Character Folding with Haystack Search

Character folding is the technique of storing accented characters in a search index as their ASCII equivalent, if one exists. This means that, for example, "café" & "māori" are treated as "cafe" and "maori" respectively, and that searches for either the accented or non-accented versions will both turn up the same results.

Implementing this technique with haystack is fairly simple, though it varies with the backend used. I'm using haystack 2.x, whoosh 2.x, and Ubuntu 10.04's bundled solr/jetty installation in these examples.

Solr

For solr, all you need to do is modify the xml schema that haystack generates. To do this, just copy the haystack/templates/search_configuration.xml directory into your main template directory, and open the solr.xml file contained therein. Find the node with the name "text", which should contain two analyzer nodes. To each of these, add the line <filter class="solr.ASCIIFoldingFilterFactory"/>. Your complete node should look something like this:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
        ...
        <filter class="solr.ASCIIFoldingFilterFactory"></filter>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
        ...
        <filter class="solr.ASCIIFoldingFilterFactory"></filter>
      </analyzer>
    </fieldType>

Then generate a new solr schema using

./manage.py build_solr_schema

Once you've copied the new schema into your solr conf, and reindexed your content, you should be able to search for "café" and have results containing "cafe" show up, and vice versa.

Whoosh

For whoosh, I've written a character folding whoosh backend for haystack which simply provides a subclass of haystack's built in whoosh backend, adding a CharsetFilter as detailed in the whoosh docs.

Download the backend from https://gist.github.com/1727204, and add the search_backends.py file to your path. Then all you need to do is add it in to your settings.py, i.e.

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'search_backends.FoldingWhooshEngine',
        'PATH': 'path-to-whoosh-index',
    },
}

Then reindex your content, and you're away.

Elasticsearch

This works exactly like it does for Whoosh, with an elasticsearch-specific backend. See Enable Asciifolding in Elasticsearch/Haystack by Mounir Messelmeni for details.


Loading