Analyzers

Nrtsearch provides some default analyzers, supports specifying all analyzers in Lucene and also supports building custom analyzers.

Proto definition for Analyzer:

message Analyzer {
    oneof AnalyzerType {
        string predefined = 1; // Analyzers predefined in Lucene, apart from standard and classic there are en.English, bn.Bengali, eu.Basque, etc. (names derived from Lucene's analyzer class names)
        CustomAnalyzer custom = 2;
    }
}

To use a predefined analyzer you need to provide the name of the analyzer in predefined. To create a custom analyzer you need to provide your analyzer in the custom field.

Predefined Analyzers

Following are the predefined analyzers available in Nrtsearch. These are derived from the analyzer classes available in Lucene.

  • ar.Arabic

  • hy.Armenian

  • eu.Basque

  • bn.Bengali

  • br.Brazilian

  • bg.Bulgarian

  • ca.Catalan

  • cjk.CJK

  • standard.Classic

  • cz.Czech

  • da.Danish

  • nl.Dutch

  • en.English

  • et.Estonian

  • fi.Finnish

  • fr.French

  • gl.Galician

  • de.German

  • el.Greek

  • hi.Hindi

  • hu.Hungarian

  • id.Indonesian

  • ga.Irish

  • it.Italian

  • lv.Latvian

  • lt.Lithuanian

  • no.Norwegian

  • fa.Persian

  • pt.Portuguese

  • ro.Romanian

  • ru.Russian

  • ckb.Sorani

  • es.Spanish

  • core.Stop

  • sv.Swedish

  • th.Thai

  • tr.Turkish

  • standard.UAX29URLEmail

  • core.UnicodeWhitespace

  • core.Whitespace

  • query.QueryAutoStopWord

  • miscellaneous.LimitTokenCount

Building Custom Analyzers

This is the proto definition for a CustomAnalyzer:

message NameAndParams {
    string name = 1;
    map<string, string> params = 2;
}

message ConditionalTokenFilter {
    NameAndParams condition = 1;
    repeated NameAndParams tokenFilters = 2;
}

// Used to be able to check if a value was set
message IntObject {
    int32 int = 1;
}

message CustomAnalyzer {
    repeated NameAndParams charFilters = 1; // Available char filters as of Lucene 8.2.0: htmlstrip, mapping, persian, patternreplace
    NameAndParams tokenizer = 2; // Specify a Lucene tokenizer (https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/analysis/Tokenizer.html). Possible options as of Lucene 8.2.0: keyword, letter, whitespace, edgengram, pathhierarchy, pattern, simplepatternsplit, classic, standard, uax29urlemail, thai, wikipedia.
    repeated NameAndParams tokenFilters = 3; // Specify a Lucene token filter (https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/analysis/TokenFilter.html). The possible options can be seen at https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html or by calling TokenFilterFactory.availableTokenFilters().
    repeated ConditionalTokenFilter conditionalTokenFilters = 4; // TODO: this is not properly supported yet, the only impl requires a protected terms file. Can support this properly later if needed
    string defaultMatchVersion = 5; // Lucene version as LUCENE_X_Y_Z or X.Y.Z, LATEST by default
    IntObject positionIncrementGap = 6; // Must be >= 0
    IntObject offsetGap = 7; // Must be >= 0
}

A custom analyzer is created by combining different character filters, tokenizer and token filters. Each of these perform different functions in analysis:

  • Character filter: change or remove characters from input text. Executed first during analysis.

  • Tokenizer: consume a stream of characters and output tokens. Executed after all character filters.

  • Token filter: change or remove tokens from token stream. Executed after tokenizer.

The API also lets you provide map<string, string> params for every character filter, tokenizer or token filter which can be used to override some default parameters for them.

Available character filters:

  • htmlstrip

  • mapping

  • persian

  • patternreplace

  • mappingV2 - Similar to the mapping filter, except rules are specified directly in the parameters. See MappingV2CharFilterFactory.

Available tokenizers:

  • keyword

  • letter

  • whitespace

  • edgengram

  • ngram

  • pathhierarchy

  • pattern

  • simplepatternsplit

  • simplepattern

  • classic

  • standard

  • uax29urlemail

  • thai

  • wikipedia.

Available token filters:

  • suggestStop

  • apostrophe

  • arabicNormalization

  • arabicStem

  • bulgarianStem

  • bengaliNormalization

  • bengaliStem

  • brazilianStem

  • cjkBigram

  • cjkWidth

  • soraniNormalization

  • soraniStem

  • commonGrams

  • commonGramsQuery

  • dictionaryCompoundWord

  • hyphenationCompoundWord

  • decimalDigit

  • lowercase

  • stop

  • type

  • uppercase

  • czechStem

  • germanLightStem

  • germanMinimalStem

  • germanNormalization

  • germanStem

  • greekLowercase

  • greekStem

  • englishMinimalStem

  • englishPossessive

  • kStem

  • porterStem

  • spanishLightStem

  • spanishMinimalStem

  • persianNormalization

  • finnishLightStem

  • frenchLightStem

  • frenchMinimalStem

  • irishLowercase

  • galicianMinimalStem

  • galicianStem

  • hindiNormalization

  • hindiStem

  • hungarianLightStem

  • hunspellStem

  • indonesianStem

  • indicNormalization

  • italianLightStem

  • latvianStem

  • minHash

  • asciiFolding

  • capitalization

  • codepointCount

  • concatenateGraph

  • dateRecognizer

  • delimitedTermFrequency

  • fingerprint

  • fixBrokenOffsets

  • hyphenatedWords

  • keepWord

  • keywordMarker

  • keywordRepeat

  • length

  • limitTokenCount

  • limitTokenOffset

  • limitTokenPosition

  • removeDuplicates

  • stemmerOverride

  • protectedTerm

  • trim

  • truncate

  • typeAsSynonym

  • wordDelimiter

  • wordDelimiterGraph

  • scandinavianFolding

  • scandinavianNormalization

  • edgeNGram

  • nGram

  • norwegianLightStem

  • norwegianMinimalStem

  • patternReplace

  • patternCaptureGroup

  • delimitedPayload

  • numericPayload

  • tokenOffsetPayload

  • typeAsPayload

  • portugueseLightStem

  • portugueseMinimalStem

  • portugueseStem

  • reverseString

  • russianLightStem

  • shingle

  • fixedShingle

  • snowballPorter

  • serbianNormalization

  • classic

  • swedishLightStem

  • synonym

  • synonymV2 - Similar to the synonymGraph filter, except rules are specified directly in the parameters. See SynonymV2GraphFilterFactory.

  • synonymGraph

  • flattenGraph

  • turkishLowercase

  • elision

Normalizers

A normalizer is similar to an analyzer, but is intended for use with ATOM (keyword) fields. It treats the entire field value as a single token (using a keyword tokenizer internally) and applies only char filters and token filters. Normalizers are applied at both index time and query time (for term queries, term-in-set queries, prefix queries, and range queries), ensuring consistent matching.

Proto definition for Normalizer:

message CustomNormalizer {
    repeated NameAndParams charFilters = 1;
    repeated NameAndParams tokenFilters = 2;
}

message Normalizer {
    oneof NormalizerType {
        string predefined = 1;       // Predefined normalizer name (e.g. "lowercase")
        CustomNormalizer custom = 2;  // Custom normalizer with char filters and token filters
    }
}

Predefined Normalizers

  • lowercase — Lowercases the entire field value.

Custom Normalizers

A custom normalizer is created by combining char filters and token filters (no tokenizer is specified, as the keyword tokenizer is always used). The available char filters and token filters are the same as those listed in the Building Custom Analyzers section above.

Example using lowercase and asciiFolding token filters:

{
    "custom": {
        "tokenFilters": [
            {"name": "lowercase"},
            {"name": "asciiFolding"}
        ]
    }
}