W

We are living in a world where the way people do business gets transformed into digital form. This phenomenon called digital transformation has resulted in propelling enormous amounts of data into digital repositories at unprecedented rates. The data can be of various shapes, sizes, qualities, and values. Indexing and searching capabilities can contribute to an organization’s success by allowing them to quickly and accurately locate specific information from large data repositories.

Search is often an underutilized feature by application developers. Typically developers use the term "search" to refer to the scenario of searching for a string. But the advanced search capabilities within the Macrometa Global Data Network (GDN) enables sophisticated search. One of the goals of this blog series is to help application developers to utilize the full set of benefits provided by GDN Search. 

The first blog in this three-part series about seven different search patterns frequently found in the Macrometa GDN, we will introduce Macrometa Search capabilities and go over exact value matching. The second blog will cover details about prefix matching, full-text token search, and phrase and proximity search. Lastly, the series will conclude with range queries, faceted search, and geospatial search. Let’s begin by providing a detailed classification and then walk through the first example.

Powering search with the Macrometa GDN

The Macrometa GDN is a multi-model, streaming NoSQL database with integrated pub/sub and stream data processing capabilities. When it comes to search, GDN brings two main advantages over other data stores.

First, the GDN is equipped with a state-of-the-art indexing and search facility that enables users to perform sophisticated search operations on multi-model data storage including key value pairs, documents, and graphs. Users are not required to reformat the data and upload them to a separate system. Instead, the persisted data on GDN can be directly used for search as well. The complexity of the search can vary — from simple search queries such as exact value matching and range queries — up to complex search operations such as faceted search, and geospatial search. Each of these different categories can be classified as a search pattern.

Secondly, with the Macrometa GDN the search indexes are updated globally whenever their underlying data gets updated. This allows the GDN to answer search queries within a very short period of time with high accuracy.

Exploring the dataset sample for search patterns

The example in this blog uses the London-based hotel reviews dataset which was obtained from Kaggle. The refined dataset has 10,000 reviews collected by crawling a leading travel portal. Each of the data items in the dataset follows a schema shown in Figure 1.

Figure 1: Schema of a hotel review

The dataset can be downloaded from this URL. After downloading the JSON file, replace <DATA> in Listing 1 with the content of the JSON file in the following command. The dataset can be imported to your GDN federation issuing the CURL command on a terminal as shown in the Listing 1. Before executing the following CURL command you need to first create a fabric named Hotels in your GDN federation and then create a document collection called hotel_reviews within that fabric.



curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/import/hotel_reviews' \
--header 'accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--header 'Content-Type: text/plain' \
--data-raw '{
  "data": <DATA>,
  "details": false,
  "primaryKey": "",
  "replace": false
}'

Listing 1: How to copy the content from the JSON file into the CURL command



curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/import/hotel_reviews' \
--header 'accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--header 'Content-Type: text/plain' \
--data-raw '{
  "data": [{
    "Property Name": "The Savoy",
    "Review Rating": 5,
    "Review Title": "a legend",
    "Review Text": "We stayed in May during a short family vacation. Location is perfect to explore all the London sights. Service and facilities are impeccable. The hotel staff was very nicely taking care of our kids. We'll be back for sure!",
    "Location Of The Reviewer": "Oslo, Norway",
    "Date Of Review": "6\/28\/2018"
}],
  "details": false,
  "primaryKey": "",
  "replace": false
}'

Listing 2: Importing a single data item into the GDN federation via invoking the REST API of a GDN node

In the above example in Listing 2 we have imported only a single review made for the hotel named The Savoy by specifying its JSON content. The values <HOST> and <BEARER_TOKEN> refers to the host name of the GDN node and the bearer token can be copied by referring to the REST API of the GDN node.

Understanding GDN Search terminology

The overall architecture of GDN Search can be summarized as follows (See Figure 2). The search is based on the concept of views which can be regarded as virtual collections. A view consists of an inverted index which provides fast full-text searching over one or multiple linked collections. The view also stores the configuration information for implementing the search capabilities such as the attributes to index. The view can use multiple, if not all attributes of the documents in the linked collections. Views can be created and maintained via the Web user interface, HTTP REST API, or via the Javascript API of GDN. All the examples listed in this blog series are based on the former two techniques.

Views can be queried using C8QL via its SEARCH operation. Search expressions are composed of the fields to search, the search terms, comparison and logical operators, as well as other GDN Search functions.

Figure 2: Overall architecture of GDN Search

The input values are processed by constructs called Analyzers. These can normalize strings and tokenize text into words, which enables various possibilities to search for values later on.

The search output can be sorted either by their similarity ranking to return the best matches, using popular scoring algorithms (TF-IDF, Okapi BM25), or by using the user-defined relevance boosting and dynamic score calculation.

Implementing exact value matching

In the most basic version of a search pattern you could try to match the presence of an exact value. The exact value can be either strings, numbers, number ranges, or booleans. Here we can index and search strings using an identity analyzer. The view used for exact value matching can be defined as shown in Listing 3.



curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/search/view' \
--header 'accept: application/json' \
--header 'Content-Type: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--data-raw '{
  "name": "sample1_view1",
  "primarySort": [],
  "links": {
    "hotel_reviews": {
      "analyzers": [
        "identity"
      ],
      "fields": {
         "Property_Name": {}
      },
      "includeAllFields": true,
      "storeValues": "none",
      "trackListPositions": true
    }
  },
  "type": "search"
}'

Listing 3: Defining a view for exact string matching

After defining the search view we can get the list of reviews made for the Rhodes Hotel as follows below.



FOR review IN sample1_view1
  SEARCH ANALYZER(review.Property_Name == "Rhodes Hotel", "identity")
  RETURN review.Property_Name

Listing 4: Match exact hotel name

It should be noted that since the default Analyzer is the identity it is not necessary to set the Analyzer context with the ANALYZER() function here. Therefore, exactly the same results can be obtained by executing the query shown in Listing 5.



FOR review IN sample1_view1
SEARCH review.Property_Name == "Rhodes Hotel"
RETURN review.Property_Name

Listing 5: Match exact hotel name using the default identity Analyzer

Matching with negations

You can search for items that do not have exact matching with specified criteria using the negations. In this scenario inequality can be checked with != operator to return everything from the view index except the documents which do not satisfy the criterion.



FOR review IN sample1_view1
  SEARCH ANALYZER(review.Property_Name != "Rhodes Hotel", "identity")
  RETURN review.Property_Name

Listing 6: Match records that are not having the specified property value

Matching multiple strings

Exact value matching can be conducted considering several item values. There are three approaches to do this — either using the logical OR operator, using the IN operator, and using the bind parameters.



FOR review IN sample1_view1
  SEARCH ANALYZER(review.Property_Name == "Apex London Wall Hotel" OR review.Property_Name == "Corinthia Hotel London", "identity")
  RETURN review.Property_Name

Listing 7: Matching multiple strings using the logical OR condition

The same query can be specified using the IN operator as shown in Listing 8.



FOR review IN sample1_view1
  SEARCH ANALYZER(review.Property_Name IN ["Apex London Wall Hotel", "Corinthia Hotel London"], "identity")
  RETURN review.Property_Name

Listing 8: Matching multiple strings using the IN operator

The third approach specifies a bind parameter as shown in Listing 9.



{
  "hotel_names": [
    "Apex London Wall Hotel",
    "Corinthia Hotel London"
  ]
}

Listing 9: Bind parameter definition for multiple strings matching

Once the bind parameter has been specified the multiple strings matching query can be written as shown in Listing 10. 



OR review IN sample1_view1
  SEARCH ANALYZER(review.Property_Name IN @hotel_names, "identity")
  RETURN review.Property_Name

Listing 10: Matching multiple strings using the bind parameters

In all these three approaches it results in a list of items (1860 items in total)¹ as shown below.

Now that you are familiar with the Macrometa GDN search architecture, you can get started on exact value matching based on this example. View the second blog in our search patterns series to learn about prefix matching, full-text token search, and phrase and proximity search.

¹The GUI displays up to 1000 records. Complete results can be found via Macrometa GDN's REST API.

Photo by Sam 🐷 on Unsplash

Posted 
Jun 24, 2022
 in 
Tutorial
 category

More from 

Tutorial

 category

View All

Join Our Newsletter and Get the Latest
Posts to Your Inbox