Skip to main content

Analyzers

Analyzers enable you to break search inputs into sets of sub-values that search views can use for improved searching and sorting. When you use an analyzer, the search view gathers the attributes of all documents in liked collections, and creates appropriate sub-values and metadata.

You can use the TOKENS() function to tokenize phrases and turn them into strings for C8QL search queries.

An analyzer processes values based on its features.

Built-in Analyzers

Macrometa provides a set of built-in analyzers.

The identity analyzer uses the frequency and norm features. All text analyzers tokenize strings with stemming enabled, no stop-words configured, case conversion set to lower, and accent mark removal enabled. The text analyzers use the frequency, norm, and position features.

NameTypeLanguage
identityidentitynone
text_detextGerman
text_entextEnglish
text_estextSpanish
text_fitextFinnish
text_frtextFrench
text_ittextItalian
text_nltextDutch
text_notextNorwegian
text_pttextPortuguese
text_rutextRussian
text_svtextSwedish

Supported Languages

Analyzers rely on ICU for language dependent tokenization and normalization. GDN ships with a data file, icudtl.dat, which contains information for supported languages.

C8DB only supports UTF-8 encoding.

Search views do not support alphabetical ordering in different languages. For example, a range query performed against a search view will not follow language rules defined in the analyzer locale.

Snowball provides stemming capabilities and supports the following languages:

CodeLanguage
deGerman
enEnglish
esSpanish
fiFinnish
frFrench
itItalian
nlDutch
noNorwegian
ptPortuguese
ruRussian
svSwedish