Posts about Web Development, Java, Magnolia CMS and beyond

7/29/2016

Customize Lucene index for searching texts with accents (or any other special character)

7/29/2016 Posted by Edwin Guilbert , , , No comments
Any application that requires full text indexing and searching capabilities frequently uses Lucene as a content retrieval library.

In content management systems its usually used for searching contents through websites. Magnolia CMS uses JCR for storing and managing all its data. JCR is implemented by Apache Jackrabbit which by default uses Lucene for full text indexing and searching.

The problem is that by default Lucene uses an english semantic analyser, which means that when indexing it will not take into account special character from other languages.

What does this mean?

It means that if you have contents with the word "competición", the index will understand that "Competición", "COMPETICIÓN" and "competición" are all the same words, but if you try to search for "competicion" (notice there is no accent here) it will return no results.

How does Lucene solve this problem?

Lucene has something called analyzers that will contain filters for indexing and searching. It will filters for lower/upper case words, stopwords (common words that are taken into account) and stems (ways to detect the semantic root of words).

You can create your own analyser with more filters if you want to fine tune your search results. However, Lucene comes with a set of common analysers for specific languages.


In the case I mentioned above we want that spanish words are correctly analysed, so we use the spanish analyser, which in the case of Magnolia will be the one included in Lucene 3.6.0 (the one used in Jackrabbit 2.8.0).

org.apache.lucene.analysis.es.SpanishAnalyzer

We have to include the following library in Magnolia (which doesn't come by default):

<dependency>
 <groupId>org.apache.lucene</groupId>
 <artifactId>lucene-analyzers</artifactId>
 <version>3.6.0</version>
</dependency>

 Then we have to configure this analyser in Lucene config xml, which in Magnolia is located in:

WEB-INF/config/default/repositories.xml

The section you need to change is:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex"> 

with the param:

<param name="analyzer" value="org.apache.lucene.analysis.es.SpanishAnalyzer"/> 

The whole section will look like this:

    <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
      <param name="path" value="${wsp.home}/index"/>
      <!-- SearchIndex will get the indexing configuration from the classpath, if not found in the workspace home -->
      <param name="indexingConfiguration" value="/info/magnolia/jackrabbit/indexing_configuration.xml"/>
      <param name="analyzer" value="org.apache.lucene.analysis.es.SpanishAnalyzer"/>
      <param name="useCompoundFile" value="true" />
      <param name="minMergeDocs" value="100" />
      <param name="volatileIdleTime" value="3" />
      <param name="maxMergeDocs" value="100000" />
      <param name="mergeFactor" value="10" />
      <param name="maxFieldLength" value="10000" />
      <param name="bufferSize" value="10" />
      <param name="cacheSize" value="1000" />
      <param name="forceConsistencyCheck" value="false" />
      <param name="autoRepair" value="true" />
      <param name="queryClass" value="org.apache.jackrabbit.core.query.QueryImpl" />
      <param name="respectDocumentOrder" value="true" />
      <param name="resultFetchSize" value="100" />
      <param name="extractorPoolSize" value="3" />
      <param name="extractorTimeout" value="100" />
      <param name="extractorBackLogSize" value="100" />

    </SearchIndex>

You need to redeploy your webapp so the indexes can get recreated

0 comentarios:

Post a Comment