Showing posts with label filter. Show all posts
Showing posts with label filter. Show all posts

Friday, 30 January 2015

Solr Revisited...

Long back I had written blogs on solr setup and how I used it in my application.
Meanwhile I am getting questions/quries from the readers on the same.
And these readers make you to work on it...revisit the technologies or stuffs your had worked...

The queries are like I am searching "Solr Blog" as text but getting the results of "Solr" as well
"Solr indexing" or my blah blah search text is not getting the correct result.
After going through all these quries I realised that where it is going wrong.
Or rather how my blogs are not helping them to resolve their issues...
So I decided to write another one which may ease their job or guide them working on solr and
will brush up my knowledge on the same subject :)

Lets go with the real example, which made me to pen down this blog ...
A reader asked me a question :
"If he search for "Abhijit & his blogs", It should show the exact match as "Abhijit & his blogs" and not matches like "Abhijit" or
or "Abhijit Songs"... ".

On this what all questions comes to your mind...?  What would be wrong here ...?
any idea...?


There is simple theory ...what you index, will be available for search... :)
In short you are using a wrong analyzer for indexing and quering...

Find out which is the fieldType been used for indexing and querying....
Is it the default provided by Solr or you have written your own custom field type(created by using the
available tokenizers and filtes)..

For the above example turn off the stemming ...
The solutions for the problem would be...

<types>

   <fieldType name="text_no_stem" class="solr.TextField" omitNorms="false">
      <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StandardFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
   </fieldType>

</types>


<fields>
   <dynamicField name="*_nostem" type="text_no_stem" indexed="true" stored="true"/>
</fields>

Or Use solr string field whcih will do an exact value search e.g

<fieldType class="solr.StrField" name="string" omitNorms="true" sortMissingLast="true" />


Now coming to the point what I realised ..?

That is we need to know more on the role tokenizers and filtes in the analyzers.
Which all types of tokenizers and filtes are available...how to make use of it and when to use it..
How to test the same...?

Here is the brief about the three important things

1. Analyzer : They pre-process the test at the time of indexing and quering(or search). Ans yes make sure you are using same analyzers that for both index and query.
     For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.
             

2. Tokenizers : It splits/seperates stream of characters into a series of tokens or small chunks/words. There can be only one Tokenizer in each Analyzer.
There are different tokenizer available to use. For example..
                KeywordTokenizerFactory
LetterTokenizerFactory
WhitespaceTokenizerFactory
StandardTokenizerFactory
  LowerCaseTokenizerFactory


3. Filters : This takes a granular level of tokes. doing many changes to it before indexing it.
       There are different filters available to use..
    LowerCaseFilterFactory
ClassicFilterFactory
StopFilterFactory
EdgeNGramFilterFactory

and many more are available.


You should be very precise about how and what is the way you want your search should work.
Then its very easy to decide the fieldType ...rather very easy to create your own type and use it.

While doing any sort of analysis do use the tool given by solr i.e analysis.jsp.
It will help you a lot for resolving your problems...

If you want to read more about the solr's tokenizer's and filter's refere the solr Wiki which has a lot info
https://wiki.apache.org/solr/

Please feel free to put on your view on it...

Friday, 30 September 2011

Solr : Use of NGram Filter factory for wild card search




WildCard Search with Solr.
Considering a scenario where one wants to have the wildcard search. In the same case Solr has provided a N-gram filter. Lets see how to use the same for various requirements. Solr has provided many tokenizers and filters. Combine these tokenizers and filters to get the desired result. The wildcard
field can be used for autocomplete feature.

  1. To have the forward wildcard search create the field type in solr schema.xml as :
<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

The text to be indexed is “Enterprise”
KeywordTokenizerFactory output is :
Enterprise
LowerCaseFilterFactory output is :
enterprise
NgramFilterFactory output is :
ent ente enter enterp enterpr enterpri enterpris enterprise

The final output for indexing is : ent ente enter enterp enterpr enterpri enterpris enterprise

In this case if you enter “enter*” or “enterpr*” you will get the result.

  1. To have the backward wildcard search create the field type in solr as :
    only one change for the backward wildcard search is the change the side to back.
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>
  2. To use the wildcard search from both side create the field type in solr as :
    In this case add Ngram filter twice.
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>

The text to be indexed is “Enterprise”
KeywordTokenizerFactory output is :
Enterprise
LowerCaseFilterFactory output is :
enterprise
NgramFilterFactory output is :
ent nte ente ter nter enter erp terp nterp enterp rpr erpr terpr nterpr enterpr pri rpri erpri terpri nterpri enterpri ris pris rpris erpris terpris nterpris enterpris ise rise prise rprise erprise terprise nterprise enterprise

The benefit of the 3rd way is when search should be for any of the character from the specified word. In my case it was the title of the document to be searched in similar way. So I indexed the title using two n-gram , one form front and other from back as shown above. But I suggest not use use this for long text like documents content, as it will take hamper the indexing performance. This field type is useful when you want to search for any of characters from that word and case like autocomplete.

You can create a customised field type using the tokenizers and filters provided by solr.
Create you own field type and analyse the same using the analysis admin page of solr.
Link to the analysis page is http://localhost:8080/solr/admin/analysis.jsp. The analysis.jsp can be used to verify the search match. It helps you in investigating what went wrong with the indexing and query output.

The analysis page looks like this for the above wildCardType when you use the Ngram filter twice from the front and back side:





If you want matching prefix substrings indexing the word from front side.

Use the below fieldType : 


<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>