Analyzers

Nrtsearch provides some default analyzers, supports specifying all analyzers in Lucene and also supports building custom analyzers.

Proto definition for Analyzer:

message Analyzer {
    oneof AnalyzerType {
        string predefined = 1; // Analyzers predefined in Lucene, apart from standard and classic there are en.English, bn.Bengali, eu.Basque, etc. (names derived from Lucene's analyzer class names)
        CustomAnalyzer custom = 2;
    }
}

To use a predefined analyzer you need to provide the name of the analyzer in predefined. To create a custom analyzer you need to provide your analyzer in the custom field.

Predefined Analyzers

Following are the predefined analyzers available in Nrtsearch. These are derived from the analyzer classes available in Lucene.

  • ar.Arabic

  • hy.Armenian

  • eu.Basque

  • bn.Bengali

  • br.Brazilian

  • bg.Bulgarian

  • ca.Catalan

  • cjk.CJK

  • standard.Classic

  • cz.Czech

  • da.Danish

  • nl.Dutch

  • en.English

  • et.Estonian

  • fi.Finnish

  • fr.French

  • gl.Galician

  • de.German

  • el.Greek

  • hi.Hindi

  • hu.Hungarian

  • id.Indonesian

  • ga.Irish

  • it.Italian

  • lv.Latvian

  • lt.Lithuanian

  • no.Norwegian

  • fa.Persian

  • pt.Portuguese

  • ro.Romanian

  • ru.Russian

  • ckb.Sorani

  • es.Spanish

  • core.Stop

  • sv.Swedish

  • th.Thai

  • tr.Turkish

  • standard.UAX29URLEmail

  • core.UnicodeWhitespace

  • core.Whitespace

  • query.QueryAutoStopWord

  • miscellaneous.LimitTokenCount

Building Custom Analyzers

This is the proto definition for a CustomAnalyzer:

message NameAndParams {
    string name = 1;
    map<string, string> params = 2;
}

message ConditionalTokenFilter {
    NameAndParams condition = 1;
    repeated NameAndParams tokenFilters = 2;
}

// Used to be able to check if a value was set
message IntObject {
    int32 int = 1;
}

message CustomAnalyzer {
    repeated NameAndParams charFilters = 1; // Available char filters as of Lucene 8.2.0: htmlstrip, mapping, persian, patternreplace
    NameAndParams tokenizer = 2; // Specify a Lucene tokenizer (https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/analysis/Tokenizer.html). Possible options as of Lucene 8.2.0: keyword, letter, whitespace, edgengram, pathhierarchy, pattern, simplepatternsplit, classic, standard, uax29urlemail, thai, wikipedia.
    repeated NameAndParams tokenFilters = 3; // Specify a Lucene token filter (https://lucene.apache.org/core/8_2_0/core/org/apache/lucene/analysis/TokenFilter.html). The possible options can be seen at https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html or by calling TokenFilterFactory.availableTokenFilters().
    repeated ConditionalTokenFilter conditionalTokenFilters = 4; // TODO: this is not properly supported yet, the only impl requires a protected terms file. Can support this properly later if needed
    string defaultMatchVersion = 5; // Lucene version as LUCENE_X_Y_Z or X.Y.Z, LATEST by default
    IntObject positionIncrementGap = 6; // Must be >= 0
    IntObject offsetGap = 7; // Must be >= 0
}

A custom analyzer is created by combining different character filters, tokenizer and token filters. Each of these perform different functions in analysis:

  • Character filter: change or remove characters from input text. Executed first during analysis.

  • Tokenizer: consume a stream of characters and output tokens. Executed after all character filters.

  • Token filter: change or remove tokens from token stream. Executed after tokenizer.

The API also lets you provide map<string, string> params for every character filter, tokenizer or token filter which can be used to override some default parameters for them.

Available character filters:

  • htmlstrip

  • mapping

  • persian

  • patternreplace

  • mappingV2 - Similar to the mapping filter, except rules are specified directly in the parameters. See MappingV2CharFilterFactory.

Available tokenizers:

  • keyword

  • letter

  • whitespace

  • edgengram

  • ngram

  • pathhierarchy

  • pattern

  • simplepatternsplit

  • simplepattern

  • classic

  • standard

  • uax29urlemail

  • thai

  • wikipedia.

Available token filters:

  • suggestStop

  • apostrophe

  • arabicNormalization

  • arabicStem

  • bulgarianStem

  • bengaliNormalization

  • bengaliStem

  • brazilianStem

  • cjkBigram

  • cjkWidth

  • soraniNormalization

  • soraniStem

  • commonGrams

  • commonGramsQuery

  • dictionaryCompoundWord

  • hyphenationCompoundWord

  • decimalDigit

  • lowercase

  • stop

  • type

  • uppercase

  • czechStem

  • germanLightStem

  • germanMinimalStem

  • germanNormalization

  • germanStem

  • greekLowercase

  • greekStem

  • englishMinimalStem

  • englishPossessive

  • kStem

  • porterStem

  • spanishLightStem

  • spanishMinimalStem

  • persianNormalization

  • finnishLightStem

  • frenchLightStem

  • frenchMinimalStem

  • irishLowercase

  • galicianMinimalStem

  • galicianStem

  • hindiNormalization

  • hindiStem

  • hungarianLightStem

  • hunspellStem

  • indonesianStem

  • indicNormalization

  • italianLightStem

  • latvianStem

  • minHash

  • asciiFolding

  • capitalization

  • codepointCount

  • concatenateGraph

  • dateRecognizer

  • delimitedTermFrequency

  • fingerprint

  • fixBrokenOffsets

  • hyphenatedWords

  • keepWord

  • keywordMarker

  • keywordRepeat

  • length

  • limitTokenCount

  • limitTokenOffset

  • limitTokenPosition

  • removeDuplicates

  • stemmerOverride

  • protectedTerm

  • trim

  • truncate

  • typeAsSynonym

  • wordDelimiter

  • wordDelimiterGraph

  • scandinavianFolding

  • scandinavianNormalization

  • edgeNGram

  • nGram

  • norwegianLightStem

  • norwegianMinimalStem

  • patternReplace

  • patternCaptureGroup

  • delimitedPayload

  • numericPayload

  • tokenOffsetPayload

  • typeAsPayload

  • portugueseLightStem

  • portugueseMinimalStem

  • portugueseStem

  • reverseString

  • russianLightStem

  • shingle

  • fixedShingle

  • snowballPorter

  • serbianNormalization

  • classic

  • swedishLightStem

  • synonym

  • synonymV2 - Similar to the synonymGraph filter, except rules are specified directly in the parameters. See SynonymV2GraphFilterFactory.

  • synonymGraph

  • flattenGraph

  • turkishLowercase

  • elision