Technical Post: Solr : Use of NGram Filter factory for wild card search

Friday, 30 September 2011

Solr : Use of NGram Filter factory for wild card search

WildCard Search with Solr.

Considering a scenario where one wants to have the wildcard search. In the same case Solr has provided a N-gram filter. Lets see how to use the same for various requirements. Solr has provided many tokenizers and filters. Combine these tokenizers and filters to get the desired result. The wildcard

field can be used for autocomplete feature.

To have the forward wildcard search create the field type in solr schema.xml as :

</analyzer>

</analyzer>

</fieldType>

The text to be indexed is “Enterprise”

KeywordTokenizerFactory output is :

Enterprise

LowerCaseFilterFactory output is :

enterprise

NgramFilterFactory output is :

ent ente enter enterp enterpr enterpri enterpris enterprise

The final output for indexing is : ent ente enter enterp enterpr enterpri enterpris enterprise

In this case if you enter “enter*” or “enterpr*” you will get the result.

To have the backward wildcard search create the field type in solr as :

only one change for the backward wildcard search is the change the side to back.

<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>
To use the wildcard search from both side create the field type in solr as :

In this case add Ngram filter twice.

The text to be indexed is “Enterprise”

KeywordTokenizerFactory output is :

Enterprise

LowerCaseFilterFactory output is :

enterprise

NgramFilterFactory output is :

ent nte ente ter nter enter erp terp nterp enterp rpr erpr terpr nterpr enterpr pri rpri erpri terpri nterpri enterpri ris pris rpris erpris terpris nterpris enterpris ise rise prise rprise erprise terprise nterprise enterprise

The benefit of the 3rd way is when search should be for any of the character from the specified word. In my case it was the title of the document to be searched in similar way. So I indexed the title using two n-gram , one form front and other from back as shown above. But I suggest not use use this for long text like documents content, as it will take hamper the indexing performance. This field type is useful when you want to search for any of characters from that word and case like autocomplete.

You can create a customised field type using the tokenizers and filters provided by solr.

Create you own field type and analyse the same using the analysis admin page of solr.

Link to the analysis page is http://localhost:8080/solr/admin/analysis.jsp. The analysis.jsp can be used to verify the search match. It helps you in investigating what went wrong with the indexing and query output.

The analysis page looks like this for the above wildCardType when you use the Ngram filter twice from the front and back side:

If you want matching prefix substrings indexing the word from front side.

Use the below fieldType :

<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

16 comments:

Anonymous21 June 2012 at 09:10
Hi Abhijit, Can you tell whether we can use both the following two filter for the same field? Basically i want the same search result for "men text" and "text men". Can you please explain how do we do that?
ReplyDelete
Replies
Abhijit Bashetti21 June 2012 at 21:55
No sir... applying above filter will create the indexes as..

me en men n en men t n t en t men t te te n te en te men te ex tex tex n tex en tex men tex xt ext text text n text en text men text

This filter is used for wild card as in your are searching for 'tex*'

then it will fetch the result as we have the index of it.

But i think you need not have use ngram in this your case because
if you use only WhitespaceTokenizerFactory in your case it has to work...

as in the field type would be ...
ReplyDelete
Replies
Anonymous22 June 2012 at 12:02
Applying this Ngram filter returns same results for "men text" and "text men" search text.

I have one more problem here.. if i search "men text" returns records based on OR operation..How i do to get AND operation for the text "men text"..
For example there are 10 results for "men" and 10 results for "text" but the text "men" and "text" both available only on 5 records so when i search "men text" i should apply AND and should get only 5 records.. I can achieve this by searching "men AND text" directly on solr console.Is there a way we can add "AND" if we see space via solr?

I tried

but it doesn't help.
ReplyDelete
Replies
Abhijit Bashetti23 July 2012 at 00:49
This comment has been removed by the author.
ReplyDelete
Replies
Abhijit Bashetti23 July 2012 at 00:50
This comment has been removed by the author.
ReplyDelete
Replies
Abhijit Bashetti23 July 2012 at 00:52
Hi,

You can modify the parameter in schema.xml as

change the solrQueryParser's defaultOperator to "AND" as it's "OR"

change it to "AND" .

But I suggest don't go for it rather add the ""(double quotes) for that word like "men text" and send the query to solr. It will interpret the same and reply with the expected result.
ReplyDelete
Replies
Unknown23 April 2013 at 08:46
Hi,
if i want to return results that only start with a, or am. or amaz. so for example i write q=a(assuming i m using dismax) so it should return all the results that start with a. or if q=am it should return all the results that start with am. rather it gives me all the result even if i have 'am' in the middle of the word and also if i have 'kindle amazon' it returns me this as well. where as i only want the one that start with am so if it was 'amazon kindle' then it should return me this document. how can i modify the code u mentioned to do my required search.

Thanks.
ReplyDelete
Replies
Unknown28 January 2015 at 02:17
Thanks Abhijiit for sharing this blog post. EdgeNGramFilterFactory doesn't consider special chars. If i search for "Abhijiit & his blogs", it should not show me "Abhijiit" or "Abhijiit songs" where as its showing. How can make this.
ReplyDelete
Replies
Anonymous23 July 2015 at 03:02
Dear Abhijit,
My requirement is that I want to block wildsearch in my site.
Can you show me the way to do so.
Girish
ReplyDelete
Replies
Abhijit Bashetti23 July 2015 at 23:00
This comment has been removed by the author.
ReplyDelete
Replies
Krishnaveni28 October 2020 at 03:39
Hi,

Could you please tell me that how to extend maxGramSize to 20? in Solr 5.2.1
ReplyDelete
Replies

Add comment