Generating charset_table maps for Sphinx

Saturday, 26 September 2009

At work, I wanted to improve the back-office search engine and installed Sphinx.

On a regular basis, it indexes all our users, courses, teachers and other important tables. It works great, low install barrier, low maintenance, and it is very very fast. Perfect.

One of the problems that we found, that limited the usefulness of the full-text search engine, is that a lot of our text has accents, and it would be better to ignore those. Also we don't need case-sensitive-ness.

So I needed to generate a charset_table map, what Sphinx uses to normalize the text that you give him to index.

And being a (very) lazy person, I prefer to write a Perl script to do it. The result is the x-sphinx-charset-generator, now part of my script stash.

It takes an optional parameter, the charset that you are using on your text defaulting to 'utf8', the loose version of UTF-8, and generates a charset_table for the most common accented characters, mapping them to the lower-case version of the same letter without the accent.

I've only include the common Portuguese characters. Patches accepted for others characters that you might need.

The only part that I don't really like is that I need to apply the same logic to cleanup the strings that users use to search. I would prefer to have a module that would take the characters that I want to allow as valid, and have that module provide the charset_table and a function to clean search inputs. Interesting, but for now this will solve the important 80% of the problem.