Tutorial

Beyond Text Search: Exploring Advanced Search Capabilities With Macrometa

Miyuru Dayarathna |

Jul 05, 2022

Some situations call for more complex search patterns. In this blog, we will cover range queries, faceted search, and geospatial search. Thank you for joining us again for the last blog in our search patterns series. If you haven’t seen the first blog yet, please take a look at the basic search capabilities of the Macrometa Global Data Network (GDN) and exact value matching examples. And if you are interested in prefix matching, full-text token, or phrase and proximity search, you may want to view the second blog. We have covered seven different search patterns in the series to help application developers realize the full benefits of Macrometa search.

Setting up the dataset

If you already downloaded the dataset from the first or second blog, you can skip this step. The dataset can be downloaded from this URL. After downloading the JSON file, replace <DATA> in Listing 1 with the content of the JSON file in the following command. The dataset can be imported to your GDN federation issuing the CURL command on a terminal as shown in the Listing 1. Before executing the following CURL command you need to first create a fabric named Hotels in your GDN federation and then create a document collection called hotel_reviews within that fabric.

curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/import/hotel_reviews' \
--header 'accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--header 'Content-Type: text/plain' \
--data-raw '{
  "data": <DATA>,
  "details": false,
  "primaryKey": "",
  "replace": false
}'

Listing 1: How to copy the content from the JSON file into the CURL command

curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/import/hotel_reviews' \
--header 'accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--header 'Content-Type: text/plain' \
--data-raw '{
  "data": [{
    "Property Name": "The Savoy",
    "Review Rating": 5,
    "Review Title": "a legend",
    "Review Text": "We stayed in May during a short family vacation. Location is perfect to explore all the London sights. Service and facilities are impeccable. The hotel staff was very nicely taking care of our kids. We'll be back for sure!",
    "Location Of The Reviewer": "Oslo, Norway",
    "Date Of Review": "6\/28\/2018"
}],
  "details": false,
  "primaryKey": "",
  "replace": false
}'

Listing 2: Importing a single data item into the GDN federation via invoking the REST API of a GDN node

In the above example in Listing 2 we have imported only a single review made for the hotel named The Savoy by specifying its JSON content. The values <HOST> and <BEARER_TOKEN> refers to the host name of the GDN node and the bearer token can be copied by referring to the REST API of the GDN node.

Implementing range queries

Range queries allow for searching data that are above, below, or between a minimum and a maximum value. The main use case for range queries is to search numeric values in documents.

Range queries can be specified comparing to a number, comparing to a numeric range, as well as comparing between strings.

When developing range queries in GDN you don't need to specify any Analyzers. This is because range queries deal with numeric values and those are not processed by Analyzers. Hence, you first need to remove the identity analyzer as shown in the exact value matching example in the first blog. This can be achieved via the following CURL command.

curl --location --request PUT 'https://api-<HOST>/_fabric/Hotels/_api/search/view/sample1_view1/properties' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--data-raw '{
   "links": {
       "hotel_reviews": {
           "analyzers": [],
           "fields": {
            "Property_Name": {}
           },
           "includeAllFields": true,
           "storeValues": "none",
           "trackListPositions": true
       }
   }
}
'

Listing 3: Removing the Analyzers defined for hotel_reviews collection

Comparing to a number

Let's take the scenario where you want to select all the hotel reviews which have a rating of 5. This can be accomplished using the following query.

FOR review IN sample1_view1
  SEARCH review.`Review Rating` == 5
  RETURN {
    Property_Name: review.Property_Name,
    `Review Rating`: review.`Review Rating`
    }

Listing 4: Find hotel reviews with rating 5

When executed the above query results in the following list of items (6764 in total).¹

The range query can be executed considering a set of numeric items. For example, the following query finds all the reviews which had ratings 3, 4, or 5.

FOR review IN sample1_view1
  SEARCH review.`Review Rating` IN [3, 4, 5]
  RETURN {
    Property_Name: review.Property_Name,
    `Review Rating`: review.`Review Rating`
    }

Listing 5: Find hotel reviews with ratings 3, 4, or 5

Since the hotel review rating values specified in the data set are in the range 1...5, the same query can be specified using the > symbol as follows.

FOR review IN sample1_view1
  SEARCH review.`Review Rating` > 2
  RETURN {
    Property_Name: review.Property_Name,
    `Review Rating`: review.`Review Rating`
    }

Listing 6: Another approach to find hotel reviews with ratings 3, 4, or 5

Each of the above queries should result in the same number of 9506 records.¹

Comparing to a numeric range

Rather than specifying each and every item in a continuous numeric range the same query can be specified using the range operator. For example, the query in Listing 5 can be rewritten as follows below.

FOR review IN sample1_view1
  SEARCH review.Review_Rating IN 3..5
  RETURN {
    Property_Name: review.Property_Name,
    Review_Rating: review.Review_Rating
    }

Listing 7: Use of range operator to find hotel reviews with ratings in the range 3 ... 5

The IN_RANGE() function allows for specifying a more advanced version of the query shown in Listing 7 by allowing you to specify the boundary conditions. When executed this should result in 9506 records.¹

FOR review IN sample1_view1
  SEARCH IN_RANGE(review.`Review Rating`, 3, 5, true, true)
  RETURN {
    Property_Name: review.Property_Name,
    `Review Rating`: review.`Review Rating`
    }

Listing 8: Use of IN_RANGE() function to find hotel reviews with ratings in the range 3 ... 5

Range search queries can be further augmented by using the standard comparison operators to search for values below and above a range. For example, you could specify a range query to collect all the reviews having ratings less than or equal to 2, greater than 4, and not equal to 1 as follows below.

FOR review IN sample1_view1
  SEARCH (review.`Review Rating` <= 2 OR review.`Review Rating` > 4) AND review.`Review Rating` != 1
  SORT review.`Review Rating`
  RETURN {
    Property_Name: review.Property_Name,
    `Review Rating`: review.`Review Rating`
    }

Listing 9: A detailed range search query specified using the standard comparison operators

When executed the above query should result in 7035 records.¹

Comparing strings

The examples in the previous subsection were purely based on numeric values. However, range comparisons can be made on strings using the standard comparison operators as well as the IN_RANGE() function. Before running such string comparisons, identity search Analyzer has to be defined by invoking the CURL command shown in Listing 2 in the first blog in the series. For example, the following query selects all the hotel names which start with Apex until (exclusive of) hotel names which start with the letter D.

FOR review IN sample1_view1
  SEARCH ANALYZER(IN_RANGE(review.Property_Name, "Apex", "D", true, false), "identity")
  SORT review.`Review Rating`
  RETURN review.Property_Name

Listing 10: Range search query which involves comparing strings

The execution of the above query results in a list of items (2037 items in total)¹ as follows.

Implementing faceted search

Faceted search allows for combining aggregation with search queries to retrieve how frequently values occur overall. We need to first define a view using the identity analyzer as follows below.

curl --location --request POST 'https://api-<HOST>/_fabric/Hotels/_api/search/view' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: <BEARER_TOKEN>' \
--data-raw '{
    "name": "sample1_view9",
  "links": {
    "hotel_reviews": {
                    "analyzers": [],
      "fields": {
        "Review Text": {
          "analyzers": [
            "identity"
          ]
        }
      }
    }
  },
  "type": "search"
}
'

Listing 11: Defining search view for faceted search

A number of reviews made for each and every hotel can be calculated using the faceted search query as follows below.

FOR review IN sample1_view9
  COLLECT name = review.Property_Name WITH COUNT INTO count
  RETURN { name, count}

Listing 12: Get all the hotels and number of reviews posted for each hotel

This should indicate the dataset has reviews on 20 hotels. The first few results are listed below.

The following query can be used to look up how many times a review carries the title "Very good". Note that here the case of the term "Very good" is exactly matched. Hence, although the title “very good” in all lowercase appears five times across this hotel review data set, only three reviews are listed as the result for the below query.

FOR review IN sample1_view1
      SEARCH ANALYZER(review.`Review Title` == "Very good", "identity")
COLLECT WITH COUNT INTO count
  RETURN count

Listing 13: Find how many times a review carries the title "Very good"

The above query should result in 3 records. For a query like above which is a simple single condition, there is an optimization that could accurately determine the count from index data faster than the standard COLLECT as follows below.

FOR review IN sample1_view1
      SEARCH ANALYZER(review.`Review Title` == "Very good", "identity")
      OPTIONS { countApproximate: "cost" }
COLLECT WITH COUNT INTO count
  RETURN count

Listing 14: Search query optimization which enables accurately determining the count from the index data faster than the standard COLLECT

If you need all the five occurrences of the title "very good" you can write a query as follows below.

Let alternatives = ["Very good", "Very Good", "very good"]
FOR alternative in alternatives
LET count = FIRST(
FOR review IN sample1_view1
      SEARCH ANALYZER(review.`Review Title` == alternative, "identity")
      OPTIONS { countApproximate: "cost" }
COLLECT WITH COUNT INTO count
  RETURN count
  ) RETURN {alternative, count}

Listing 15: Search query which lists the occurrence of the title "very good" in different forms

Execution of the above query should result in three records as follows below.

Exploring geospatial search

Geospatial search is a less frequently found but very useful feature when it comes to implementing geography-related information processing in general data analytics applications. Traditionally a special class of information systems called Geographic Information Systems (GIS) have been used to deal with spatial data extensively. However, NoSQL data stores such as Macrometa GDN have the capability of storing and searching on spatial data which has become very useful when developing applications where geospatial search is required. Macrometa GDN supports geospatial queries such as finding coordinates and shapes within a radius or an area.

Geospatial datasets

In order to try out geospatial search capabilities of Macrometa GDN we have selected a dataset from the Seattle metropolis in the USA. The dataset corresponds to public schools in Seattle. This dataset was obtained from the DATA.GOV website.

curl --location --request POST 'https://api-<HOST>/_fabric/SeattleSchools/_api/import/schools' \
--header 'Authorization: <BEARER_TOKEN>' \
--header 'Accept: application/json' \
--header 'Content-Type: text/plain' \
--data-raw '{
  "data": [
{ "type": "Feature", "properties": { "OBJECTID": 1, "TYPE": "NonStandard", "SCHOOL": "QUEEN ANNE GYM", "ADDRESS": "1431 2nd Ave N", "SE_ANNO_CAD_DATA": "null", "NAME": "Queen Anne Gym", "GRADE": "9-12", "CITY": "Seattle", "ZIP": "98109", "PHONE": "null", "WEBSITE": "null", "XCOORD": 1265680.67393531, "YCOORD": 234243.29115321999, "SITE_USE": "Active", "PRJ_ENRLLMNT": "null" }, "geometry": { "type": "Point", "coordinates": [ -122.353265218350501, 47.632022747314181 ] } }
],
  "details": false,
  "primaryKey": "",
  "replace": false
}'

Listing 16: Importing the sample dataset of Seattle's schools into GDN federation via invoking the REST API of a GDN node

Note that Listing 16 shows insertion of only one data item using the CURL command for the illustration purposes. However, you need to replace [] of "data" element with the content from the schools file before running the following sample queries. Furthermore, <HOST> and <BEARER_TOKEN> values have to be replaced similar to the previous examples.

Next, we will execute a query to identify all the schools which are located within 1000 meters of Saint George Church. This can be specified as follows below. Here geo coordinates -122.31551191249362 and 47.55458207164884 correspond to the geo location of the Saint George Church.

LET loc = GEO_POINT(-122.31551191249362, 47.55458207164884)
FOR x IN schools
  FILTER GEO_DISTANCE(loc, x.geometry) <= 1000
  RETURN x.properties.NAME

Listing 17: Get all the schools which are located within 1km distance from the Saint George Church

This should result in two schools named Cleveland STEM and Maple. These two schools along with the Saint George Church can be visualized on a map as shown below.

Visualization of geospatial query results

Conclusion

The ability to specify sophisticated search queries is very important for any data management system. It is because many of the search scenarios we encounter today demand the ability for conducting search queries at different levels of complexity and involve various types of datasets.

The first blog explained how the search capabilities of the GDN have been implemented and showed an example of how to execute exact value matching. The second blog provided a step-by-step guide on how to set up prefix matching, full-text token, and phrase and proximity search on the Macrometa GDN using several real-world data sets. This last blog provided details on range queries, faceted search, and geospatial search.

The search pattern examples also listed the details of the expected results so that you can verify the results when you try these queries in the Macrometa GDN. View our search or Macrometa GDN documentation for more information, or contact us to talk to an expert.

¹The GUI displays up to 1000 records. Complete results can be found via Macrometa GDN's REST API.

Photo by José Martín Ramírez Carrasco on Unsplash