Visualising Sitecore Analyzers
When Sitecore indexes your content, Lucene analyzers work to break down your text into a series of individual tokens. For instance, a simple analyzer might convert input text to lowercase, split into separate words, and remove punctuation:
- input: Hi there! My name is Chris.
- output tokens: “hi”, “there”, “my”, “name”, “is”, “chris”
While this happens behind the scenes, and is usually not of too much interest outside of diagnostics or curiosity, there’s a way we can view the output of the analyzers bundled with Sitecore.
Let’s get some input text to analyze, in both English and French:
Next, let’s write a generic method which takes some text and a Lucene analyzer, and runs the text through the analyzer:
Now, let’s try this out on some Sitecore analyzers!
CaseSensitiveStandardAnalyzer
retains case, but removes punctuation and stop words (common words which offer no real value when searching)
LowerCaseKeywordAnalyzer
convers the input to lowercase, but retains the punctuation and doesn’t split the input into separate words.
NGramAnalyzer
breaks text up into trigrams which are useful for autocomplete. See more here.
StandardAnalyzerWithStemming
introduces stemming, which finds a common root for similar words (lazy, lazily, laze -> lazi)
SynonymAnalyzer
uses a set of synonyms (in our case, defined in an XML file) to index synonyms (fast, rapid) along with the original word (quick). Read more: http://firebreaksice.com/sitecore-synonym-search-with-lucene/
Lastly, we try a FrenchAnalyzer
. Stop words are language specific, and so the community often contributes analyzers which will remove stop words in languages other than English. In the example below, we remove common French words.
The full code is here: https://gist.github.com/christofur/e2ea406c21bccd3b032c9b861df0749b