Friday, 30 September 2011

Solr : Use of NGram Filter factory for wild card search




WildCard Search with Solr.
Considering a scenario where one wants to have the wildcard search. In the same case Solr has provided a N-gram filter. Lets see how to use the same for various requirements. Solr has provided many tokenizers and filters. Combine these tokenizers and filters to get the desired result. The wildcard
field can be used for autocomplete feature.

  1. To have the forward wildcard search create the field type in solr schema.xml as :
<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

The text to be indexed is “Enterprise”
KeywordTokenizerFactory output is :
Enterprise
LowerCaseFilterFactory output is :
enterprise
NgramFilterFactory output is :
ent ente enter enterp enterpr enterpri enterpris enterprise

The final output for indexing is : ent ente enter enterp enterpr enterpri enterpris enterprise

In this case if you enter “enter*” or “enterpr*” you will get the result.

  1. To have the backward wildcard search create the field type in solr as :
    only one change for the backward wildcard search is the change the side to back.
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>
  2. To use the wildcard search from both side create the field type in solr as :
    In this case add Ngram filter twice.
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>

The text to be indexed is “Enterprise”
KeywordTokenizerFactory output is :
Enterprise
LowerCaseFilterFactory output is :
enterprise
NgramFilterFactory output is :
ent nte ente ter nter enter erp terp nterp enterp rpr erpr terpr nterpr enterpr pri rpri erpri terpri nterpri enterpri ris pris rpris erpris terpris nterpris enterpris ise rise prise rprise erprise terprise nterprise enterprise

The benefit of the 3rd way is when search should be for any of the character from the specified word. In my case it was the title of the document to be searched in similar way. So I indexed the title using two n-gram , one form front and other from back as shown above. But I suggest not use use this for long text like documents content, as it will take hamper the indexing performance. This field type is useful when you want to search for any of characters from that word and case like autocomplete.

You can create a customised field type using the tokenizers and filters provided by solr.
Create you own field type and analyse the same using the analysis admin page of solr.
Link to the analysis page is http://localhost:8080/solr/admin/analysis.jsp. The analysis.jsp can be used to verify the search match. It helps you in investigating what went wrong with the indexing and query output.

The analysis page looks like this for the above wildCardType when you use the Ngram filter twice from the front and back side:





If you want matching prefix substrings indexing the word from front side.

Use the below fieldType : 


<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>









16 comments:

  1. Hi Abhijit, Can you tell whether we can use both the following two filter for the same field? Basically i want the same search result for "men text" and "text men". Can you please explain how do we do that?

    ReplyDelete
  2. No sir... applying above filter will create the indexes as..

    me en men n en men t n t en t men t te te n te en te men te ex tex tex n tex en tex men tex xt ext text text n text en text men text

    This filter is used for wild card as in your are searching for 'tex*'

    then it will fetch the result as we have the index of it.

    But i think you need not have use ngram in this your case because
    if you use only WhitespaceTokenizerFactory in your case it has to work...

    as in the field type would be ...

    ReplyDelete
  3. Applying this Ngram filter returns same results for "men text" and "text men" search text.

    I have one more problem here.. if i search "men text" returns records based on OR operation..How i do to get AND operation for the text "men text"..
    For example there are 10 results for "men" and 10 results for "text" but the text "men" and "text" both available only on 5 records so when i search "men text" i should apply AND and should get only 5 records.. I can achieve this by searching "men AND text" directly on solr console.Is there a way we can add "AND" if we see space via solr?

    I tried


    but it doesn't help.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Hi,

    You can modify the parameter in schema.xml as

    change the solrQueryParser's defaultOperator to "AND" as it's "OR"

    change it to "AND" .

    But I suggest don't go for it rather add the ""(double quotes) for that word like "men text" and send the query to solr. It will interpret the same and reply with the expected result.

    ReplyDelete
  7. Hi,
    if i want to return results that only start with a, or am. or amaz. so for example i write q=a(assuming i m using dismax) so it should return all the results that start with a. or if q=am it should return all the results that start with am. rather it gives me all the result even if i have 'am' in the middle of the word and also if i have 'kindle amazon' it returns me this as well. where as i only want the one that start with am so if it was 'amazon kindle' then it should return me this document. how can i modify the code u mentioned to do my required search.

    Thanks.

    ReplyDelete
    Replies
    1. Use only this filter


      As you want matching prefix substrings indexing the word from front side.

      The text to be indexed is “Enterprise”
      NgramFilterFactory output is :
      ent ente enter enterp enterpr enterpri enterpris enterprise

      if you want it from the first character then change minGramSize="1"

      Fiter would be like


      The text to be indexed is “Enterprise”
      NgramFilterFactory output is :
      e en ent ente enter enterp enterpr enterpri enterpris enterprise

      You can use only this which should solve your problem.











      Delete
  8. Thanks Abhijiit for sharing this blog post. EdgeNGramFilterFactory doesn't consider special chars. If i search for "Abhijiit & his blogs", it should not show me "Abhijiit" or "Abhijiit songs" where as its showing. How can make this.

    ReplyDelete
    Replies
    1. Hi Amol..
      I am glad to know that its useful to you...
      Have you written/customised the fieldType?
      Is it possible for you to share the fieldType with me?

      It would be easy for me to help you with same..

      Delete
    2. Hi,

      whats the exact is indexing text and search text for you?

      I think in your case you should not use "EdgeNGramFilterFactory" as it seems you dont want micro indexing.

      You should have KeywordTokenizerFactory as your tokenizer.

      Delete
  9. Dear Abhijit,
    My requirement is that I want to block wildsearch in my site.
    Can you show me the way to do so.
    Girish

    ReplyDelete
    Replies
    1. The StrField type is not analyzed, but indexed/stored verbatim.

      Delete
    2. You can change the fieldType for the field you are searching on.
      It all depends on how you want to provide the search.
      You can use string field type for the same field by which the there wont be any analysis on the same field.

      The StrField type is not analyzed, but indexed/stored verbatim.

      Change the Tokenizer from StandardTokenizerFactory to KeywordTokenizerFactory.

      Delete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Hi,

    Could you please tell me that how to extend maxGramSize to 20? in Solr 5.2.1

    ReplyDelete