Character Folding with Haystack and ElasticSearch, Whoosh or Solr

2016-02-01

Character folding is the technique of storing accented characters in a search index as their ASCII equivalent, if one exists. This means that, for example, "café" & "māori" are treated as "cafe" and "maori" respectively, and that searches for either the accented or non-accented versions will both turn up the same results.

This is very useful in English, where accents are effectively optional - search results need to be consistent whether the user knows the correct, accented word or not.

Implementing this technique with haystack is fairly simple, though it varies with the backend used. I'm using haystack 2.x, whoosh 2.x, and Ubuntu 10.04's bundled solr/jetty installation in these examples.

Whoosh

For whoosh, I've written a character folding whoosh backend for haystack which simply provides a subclass of haystack's built in Whoosh backend, adding a CharsetFilter as detailed in the whoosh docs.

Download the backend from https://gist.github.com/gregplaysguitar/1727204, and add folding_whoosh_backend.py to your python path. Then all you need to do is add it in to your settings.py, i.e.

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'folding_whoosh_backend.FoldingWhooshEngine',
        'PATH': 'path-to-whoosh-index',
    },
}

Then reindex your content, and you're away.

./manage.py rebuild_index

Elasticsearch

This works exactly like it does for Whoosh, with an elasticsearch-specific backend. See Enable Asciifolding in Elasticsearch/Haystack by Mounir Messelmeni for details.

Solr

I haven't used solr since I discovered ElasticSearch, but working with it should be pretty straightforward. You'll need to modify the xml schema that haystack generates. To do this, just copy the haystack/templates/search_configuration.xml directory into your main template directory, and open the solr.xml file contained therein. Find the node with the name "text", which should contain two analyzer nodes. To each of these, add the line <filter class="solr.ASCIIFoldingFilterFactory"/>. Your complete node should look something like this:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
        ...
        <filter class="solr.ASCIIFoldingFilterFactory"></filter>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
        ...
        <filter class="solr.ASCIIFoldingFilterFactory"></filter>
      </analyzer>
    </fieldType>

Then generate a new solr schema using

./manage.py build_solr_schema

Once you've copied the new schema into your solr conf, and reindexed your content, you should be able to search for "café" and have results containing "cafe" show up, and vice versa.