We are living in a world where the way people do business gets transformed into digital form. This phenomenon called digital transformation has resulted in propelling enormous amounts of data into digital repositories at unprecedented rates. The data can be of various shapes, sizes, qualities, and values. Indexing and searching capabilities can contribute to an organization’s success by allowing them to quickly and accurately locate specific information from large data repositories.
Search is often an underutilized feature by application developers. Typically developers use the term "search" to refer to the scenario of searching for a string. But the advanced search capabilities within the Macrometa Global Data Network (GDN) enables sophisticated search. One of the goals of this blog series is to help application developers to utilize the full set of benefits provided by GDN Search.
The first blog in this three-part series about seven different search patterns frequently found in the Macrometa GDN, we will introduce Macrometa Search capabilities and go over exact value matching. The second blog will cover details about prefix matching, full-text token search, and phrase and proximity search. Lastly, the series will conclude with range queries, faceted search, and geospatial search. Let’s begin by providing a detailed classification and then walk through the first example.
Powering search with the Macrometa GDN
The Macrometa GDN is a multi-model, streaming NoSQL database with integrated pub/sub and stream data processing capabilities. When it comes to search, GDN brings two main advantages over other data stores.
First, the GDN is equipped with a state-of-the-art indexing and search facility that enables users to perform sophisticated search operations on multi-model data storage including key value pairs, documents, and graphs. Users are not required to reformat the data and upload them to a separate system. Instead, the persisted data on GDN can be directly used for search as well. The complexity of the search can vary — from simple search queries such as exact value matching and range queries — up to complex search operations such as faceted search, and geospatial search. Each of these different categories can be classified as a search pattern.
Secondly, with the Macrometa GDN the search indexes are updated globally whenever their underlying data gets updated. This allows the GDN to answer search queries within a very short period of time with high accuracy.
Exploring the dataset sample for search patterns
The example in this blog uses the London-based hotel reviews dataset which was obtained from Kaggle. The refined dataset has 10,000 reviews collected by crawling a leading travel portal. Each of the data items in the dataset follows a schema shown in Figure 1.
The dataset can be downloaded from this URL. After downloading the JSON file, replace <DATA> in Listing 1 with the content of the JSON file in the following command. The dataset can be imported to your GDN federation issuing the CURL command on a terminal as shown in the Listing 1. Before executing the following CURL command you need to first create a fabric named Hotels in your GDN federation and then create a document collection called hotel_reviews within that fabric.
Listing 1: How to copy the content from the JSON file into the CURL command
Listing 2: Importing a single data item into the GDN federation via invoking the REST API of a GDN node
In the above example in Listing 2 we have imported only a single review made for the hotel named The Savoy by specifying its JSON content. The values <HOST> and <BEARER_TOKEN> refers to the host name of the GDN node and the bearer token can be copied by referring to the REST API of the GDN node.
Understanding GDN Search terminology
Views can be queried using C8QL via its SEARCH operation. Search expressions are composed of the fields to search, the search terms, comparison and logical operators, as well as other GDN Search functions.
The input values are processed by constructs called Analyzers. These can normalize strings and tokenize text into words, which enables various possibilities to search for values later on.
The search output can be sorted either by their similarity ranking to return the best matches, using popular scoring algorithms (TF-IDF, Okapi BM25), or by using the user-defined relevance boosting and dynamic score calculation.
Implementing exact value matching
In the most basic version of a search pattern you could try to match the presence of an exact value. The exact value can be either strings, numbers, number ranges, or booleans. Here we can index and search strings using an identity analyzer. The view used for exact value matching can be defined as shown in Listing 3.
Listing 3: Defining a view for exact string matching
After defining the search view we can get the list of reviews made for the Rhodes Hotel as follows below.
Listing 4: Match exact hotel name
It should be noted that since the default Analyzer is the identity it is not necessary to set the Analyzer context with the ANALYZER() function here. Therefore, exactly the same results can be obtained by executing the query shown in Listing 5.
Listing 5: Match exact hotel name using the default identity Analyzer
Matching with negations
You can search for items that do not have exact matching with specified criteria using the negations. In this scenario inequality can be checked with != operator to return everything from the view index except the documents which do not satisfy the criterion.
Listing 6: Match records that are not having the specified property value
Matching multiple strings
Exact value matching can be conducted considering several item values. There are three approaches to do this — either using the logical OR operator, using the IN operator, and using the bind parameters.
Listing 7: Matching multiple strings using the logical OR condition
The same query can be specified using the IN operator as shown in Listing 8.
Listing 8: Matching multiple strings using the IN operator
The third approach specifies a bind parameter as shown in Listing 9.
Listing 9: Bind parameter definition for multiple strings matching
Once the bind parameter has been specified the multiple strings matching query can be written as shown in Listing 10.
Listing 10: Matching multiple strings using the bind parameters
In all these three approaches it results in a list of items (1860 items in total)¹ as shown below.
Now that you are familiar with the Macrometa GDN search architecture, you can get started on exact value matching based on this example. View the second blog in our search patterns series to learn about prefix matching, full-text token search, and phrase and proximity search.
¹The GUI displays up to 1000 records. Complete results can be found via Macrometa GDN's REST API.