Showing posts with label analysis. Show all posts
Showing posts with label analysis. Show all posts

Friday, 30 September 2011

Solr : Use of NGram Filter factory for wild card search




WildCard Search with Solr.
Considering a scenario where one wants to have the wildcard search. In the same case Solr has provided a N-gram filter. Lets see how to use the same for various requirements. Solr has provided many tokenizers and filters. Combine these tokenizers and filters to get the desired result. The wildcard
field can be used for autocomplete feature.

  1. To have the forward wildcard search create the field type in solr schema.xml as :
<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

The text to be indexed is “Enterprise”
KeywordTokenizerFactory output is :
Enterprise
LowerCaseFilterFactory output is :
enterprise
NgramFilterFactory output is :
ent ente enter enterp enterpr enterpri enterpris enterprise

The final output for indexing is : ent ente enter enterp enterpr enterpri enterpris enterprise

In this case if you enter “enter*” or “enterpr*” you will get the result.

  1. To have the backward wildcard search create the field type in solr as :
    only one change for the backward wildcard search is the change the side to back.
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>
  2. To use the wildcard search from both side create the field type in solr as :
    In this case add Ngram filter twice.
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>

The text to be indexed is “Enterprise”
KeywordTokenizerFactory output is :
Enterprise
LowerCaseFilterFactory output is :
enterprise
NgramFilterFactory output is :
ent nte ente ter nter enter erp terp nterp enterp rpr erpr terpr nterpr enterpr pri rpri erpri terpri nterpri enterpri ris pris rpris erpris terpris nterpris enterpris ise rise prise rprise erprise terprise nterprise enterprise

The benefit of the 3rd way is when search should be for any of the character from the specified word. In my case it was the title of the document to be searched in similar way. So I indexed the title using two n-gram , one form front and other from back as shown above. But I suggest not use use this for long text like documents content, as it will take hamper the indexing performance. This field type is useful when you want to search for any of characters from that word and case like autocomplete.

You can create a customised field type using the tokenizers and filters provided by solr.
Create you own field type and analyse the same using the analysis admin page of solr.
Link to the analysis page is http://localhost:8080/solr/admin/analysis.jsp. The analysis.jsp can be used to verify the search match. It helps you in investigating what went wrong with the indexing and query output.

The analysis page looks like this for the above wildCardType when you use the Ngram filter twice from the front and back side:





If you want matching prefix substrings indexing the word from front side.

Use the below fieldType : 


<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>