Explaining Lucene explain
Each time you perform a search using Lucene, a score is applied to the results returned by your query.
In our index, # is the unique document number, score is the the closeness of each hit to our query, and tags is a text field belonging to a document.
There are many methods Lucene can use to calculate scoring. By default, we use the DefaultSimilarity implementation of the Similarity abstract class. This class implements the commonly referenced TfIdf scoring formula:
(more: https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/search/Similarity.html)
If you’re new to Lucene (or even if you’re not!) this formula can be a bit to get your head around. To get inside the formula for a given search result, Lucene provides an explanation feature, which we can call from code (c# example in Lucene.Net):
Calling searcher.Explain(query, match.doc) gives us a text output explanation of how the matched document scores against the query:
Ok! But still, there’s a lot going on in there. Let’s try and break it down.
- 2.4824 is the total score for this single search result. As our query contained two terms, ‘movies’ and ‘kids’, Lucene breaks the overall query down into two subqueries.
- The sum of the two subqueries (1.4570 for ‘movies’ and 1.0255 for ‘kids’) are added to arrive at our total score.
For our first subquery, the ‘movies’ part, we arrive at the score of 1.4570 by multiplying queryWeight (0.709) by fieldWeight (2.0581). Let’s go line by line:
↑ The total score for the ‘movies’ subquery is 1.4570. ‘tags:movies‘ is the raw query, 127 is the individual document number we’re examining, and DefaultSimilarity is the scoring mecahsnism we’re using.
↑ The term (‘movies‘) appears twice in the ‘tags‘ field for document 127, so we get a term frequency of 2.0
↑ queryWeight (0.7079) is how rare the search term is within the whole index – in our case, ‘movies‘ appears in 147 out of the 1000 documents in our index. This normalization factor is the same for all results returned by our query and just stops the queryWeight scores from becoming too exaggerated for any single result.
↑ This rarity is called inverse document frequency (idf)
↑ .. and is itself multiplied by a normalization factor (0.2432) called queryNorm. This normalization factor is the same for all results returned by our query and just stops the queryWeight scores from becoming too exaggerated for any single result.
↑ fieldWeight (2.0581) is how often the search term (‘movies‘) appears in the field we searched on ‘tags’.
↑ We take the square root of the termFreq (2.0) = 1.4142
↑ This is multiplied by the idf which we calculated above (2.9105)
↑ and finally by a field normalization factor (0.5000), which tells us how many overall terms were in the field. This ‘boost‘ value will be higher for shorter fields – meaning the more promenant your search term was in a field, the more relevant the result.
#
Further reading:http://www.lucenetutorial.com/advanced-topics/scoring.html https://ayende.com/blog/166274/the-lucene-formula-tf-idf Happy Lucene hacking!
When Sitecore indexes your content, Lucene analyzers work to break down your text into a series of individual tokens. For instance, a simple analyzer might convert input text to lowercase, split into separate words, and remove punctuation:
- input: Hi there! My name is Chris.
- output tokens: “hi”, “there”, “my”, “name”, “is”, “chris”
While this happens behind the scenes, and is usually not of too much interest outside of diagnostics or curiosity, there’s a way we can view the output of the analyzers bundled with Sitecore.
Let’s get some input text to analyze, in both English and French:
Next, let’s write a generic method which takes some text and a Lucene analyzer, and runs the text through the analyzer:
Now, let’s try this out on some Sitecore analyzers!
CaseSensitiveStandardAnalyzer
retains case, but removes punctuation and stop words (common words which offer no real value when searching)
LowerCaseKeywordAnalyzer
convers the input to lowercase, but retains the punctuation and doesn’t split the input into separate words.
NGramAnalyzer
breaks text up into trigrams which are useful for autocomplete. See more here.
StandardAnalyzerWithStemming
introduces stemming, which finds a common root for similar words (lazy, lazily, laze -> lazi)
SynonymAnalyzer
uses a set of synonyms (in our case, defined in an XML file) to index synonyms (fast, rapid) along with the original word (quick). Read more: http://firebreaksice.com/sitecore-synonym-search-with-lucene/
Lastly, we try a FrenchAnalyzer
. Stop words are language specific, and so the community often contributes analyzers which will remove stop words in languages other than English. In the example below, we remove common French words.
The full code is here: https://gist.github.com/christofur/e2ea406c21bccd3b032c9b861df0749b