Technical Post: 2011

Tuesday, 8 November 2011

Multiple core in Solr

Multiple cores allows you create seperate index files for every single module of your application. Here each core has its own configuration file e.g every core can have its own data-config.xml, solr-config.xml and schema.xml. We can make the seperate index file for each core. You can also administer those cores using http://localhost:8080/solr/ .

Cores are created on the fly by using http://localhost:8080/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path_to_instance_directory&config=config_file_name.xml&schema=schem_file_name.xml&dataDir=data. Here the Create is action name for creating core, name is the unique name for the core, instanceDir is the path solr home, config is the path of the solrConfig.xml, schema is the path of schema.xml and finally the dataDir is the path where one wants to store the index files.

In my case it was : http://localhost:8080/solr/admin/cores?action=CREATE&name=9476067&instanceDir=/home/abashetti/Downloads/abhijit/solr/&config=/home/abashetti/Downloads/abhijit/solr/conf/solrconfig.xml&schema=/home/abashetti/Downloads/abhijit/solr/conf/schema.xml&dataDir=/home/abashetti/Downloads/abhijit/solr/9476067/document/data

This will create an entry in the solr.xml file.

The default solr.xml file will look like :
The defaultCoreName must be mentioned the xml file.
If the default core name is missing you will get the error on the browser as :

<?xml version="1.0" encoding="UTF-8" ?>

<solr persistent="false">
<cores adminPath="/admin/cores" defaultCoreName="collection1">
<core name="collection1" instanceDir="." />

</cores>

</solr>

To create core on fly one need to modify the solr.xml , the attribute persistent ="true" is the change for it.

<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
<cores adminPath="/admin/cores" defaultCoreName="collection1">
<core name="collection1" instanceDir="." />
<core name="9476067" instanceDir="./" config="/home/abashetti/Downloads/abhijit/solr/conf/solrconfig.xml"
schema=/home/abashetti/Downloads/abhijit/solr/conf/schema.xml
dataDir="/home/abashetti/Downloads/abhijit/solr/9476067/data"
/>
<core name="12385546" instanceDir="./"

config="/home/abashetti/Downloads/abhijit/solr/conf/solrconfig.xml"
schema=/home/abashetti/Downloads/abhijit/solr/conf/schema.xml

dataDir="/home/abashetti/Downloads/abhijit/solr/12385546/data"/>

</cores>

</solr>

The core can be reloaded, removed on the fly. To remove/unload the core on fly use the url
http://localhost:8080/solr/admin/cores?action=UNLOAD&core=9476067&deleteIndex=true

The above url will remove the entry of core from solr.xml and remove the index files from the dataDir.

Here the UNLOAD is the action for removing the core and core attribute asks you for the core name and if you want to delete the index files then set deleteIndex value is true. If deleteIndex is false then solr will remove only the core entry from the solr.xml.

The url to check all the core is http://localhost:8080/solr/admin/

Saturday, 15 October 2011

oracle case insensitive

Searching on column which contains the document name and document name is in Lower case and Upper case. To find the document with document name the query would be:

Select d.document_name from document d where LOWER(d.document_name) like LOWER('%java%')

Select d.document_name from document d where UPPER(d.document_name) like UPPER('%java%')

Sunday, 9 October 2011

Converting VARCHAR2 to CLOB and CLOB to VARCHAR2 for ORACLE 10g

1. Converting Varchar2 to Clob

ALTER TABLE TEST ADD (TEMP_DESCRIPTION_TEXT CLOB);

Add a column named "TEMP_DESCRIPTION_TEXT"to the table whose data type will be CLOB.

UPDATE TEST SET TEMP_DESCRIPTION_TEXT=DESCRIPTION_TEXT;
COMMIT;

Copy the text from existing column "DESCRIPTION_TEXT" to the new column "TEMP_DESCRIPTION_TEXT".

ALTER TABLE TEST DROP COLUMN DESCRIPTION_TEXT;

Drop the old column named "DESCRIPTION_TEXT".

ALTER TABLE TEST RENAME COLUMN TEMP_DESCRIPTION_TEXT TO DESCRIPTION_TEXT;

Rename the new column "TEMP_DESCRIPTION_TEXT" with old name "DESCRIPTION_TEXT".

2. Converting Clob to Varchar2

ALTER TABLE TEST ADD (TEMP_DESCRIPTION_TEXT VARCHAR2(4000 BYTE));

Add a column named "TEMP_DESCRIPTION_TEXT"to the table whose data type will be VARCHAR2.

UPDATE TEST SET TEMP_DESCRIPTION_TEXT=DBMS_LOB.SUBSTR(DESCRIPTION_TEXT,4000,1);
COMMIT;

Copy the text from existing column "DESCRIPTION_TEXT" to the new column "TEMP_DESCRIPTION_TEXT".

ALTER TABLE TEST DROP COLUMN DESCRIPTION_TEXT;

Drop the old column named "DESCRIPTION_TEXT".

ALTER TABLE TEST RENAME COLUMN TEMP_DESCRIPTION_TEXT TO DESCRIPTION_TEXT;

Rename the new column "TEMP_DESCRIPTION_TEXT" with old name "DESCRIPTION_TEXT".

Friday, 30 September 2011

Solr : Use of NGram Filter factory for wild card search

WildCard Search with Solr.

Considering a scenario where one wants to have the wildcard search. In the same case Solr has provided a N-gram filter. Lets see how to use the same for various requirements. Solr has provided many tokenizers and filters. Combine these tokenizers and filters to get the desired result. The wildcard

field can be used for autocomplete feature.

To have the forward wildcard search create the field type in solr schema.xml as :

</analyzer>

</analyzer>

</fieldType>

The text to be indexed is “Enterprise”

KeywordTokenizerFactory output is :

Enterprise

LowerCaseFilterFactory output is :

enterprise

NgramFilterFactory output is :

ent ente enter enterp enterpr enterpri enterpris enterprise

The final output for indexing is : ent ente enter enterp enterpr enterpri enterpris enterprise

In this case if you enter “enter*” or “enterpr*” you will get the result.

To have the backward wildcard search create the field type in solr as :

only one change for the backward wildcard search is the change the side to back.

<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="back"/>
To use the wildcard search from both side create the field type in solr as :

In this case add Ngram filter twice.

The text to be indexed is “Enterprise”

KeywordTokenizerFactory output is :

Enterprise

LowerCaseFilterFactory output is :

enterprise

NgramFilterFactory output is :

ent nte ente ter nter enter erp terp nterp enterp rpr erpr terpr nterpr enterpr pri rpri erpri terpri nterpri enterpri ris pris rpris erpris terpris nterpris enterpris ise rise prise rprise erprise terprise nterprise enterprise

The benefit of the 3rd way is when search should be for any of the character from the specified word. In my case it was the title of the document to be searched in similar way. So I indexed the title using two n-gram , one form front and other from back as shown above. But I suggest not use use this for long text like documents content, as it will take hamper the indexing performance. This field type is useful when you want to search for any of characters from that word and case like autocomplete.

You can create a customised field type using the tokenizers and filters provided by solr.

Create you own field type and analyse the same using the analysis admin page of solr.

Link to the analysis page is http://localhost:8080/solr/admin/analysis.jsp. The analysis.jsp can be used to verify the search match. It helps you in investigating what went wrong with the indexing and query output.

The analysis page looks like this for the above wildCardType when you use the Ngram filter twice from the front and back side:

If you want matching prefix substrings indexing the word from front side.

Use the below fieldType :

<fieldType name="wildCardType" class="solr.TextField" sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Thursday, 29 September 2011

SOLR Oracle data-importer with variableResolver in data-config.xml

Most applications store data in relational databases like mysql , oracle , db2 ...and searching over such data is a common use-case. The DataImportHandler is a Solr contrib that provides a configuration driven way to import this data into Solr. For the same we will create the data-config.xml. The data-config.xml will have the variable in the query.

The data-config for MySql will look like this :

<entity name="book" dataSource="ds-db" query="select distinct

book.id as id,

book.title,

book.author,

book.publisher,

from Books book

where book.book_added_date >= to_date($ {dataimporter.request.lastIndexDate}, 'DD/MM/YYYY HH24:MI:SS')))"

transformer="DateFormatTransformer">

</entity>

</document>

</dataConfig>

In the url you need to pass the variable resolver with value.

The url to start the data-import in this case will be :

http://localhost:8080/solr/admin/select/?qt=/dataimport&command=full-import&clean=false&commit=true&lastIndexDate='08/05/2011 20:16:11'

For the first time indexing you need pass “lastIndexDate=null”.

The data-config for Oracle will look like this :