Building simplicidade.org: notes, projects, and occasional rants

utf-8 and DBD::mysql

After an afternoon trying to understand why some of my output from a utf8 table in MySQL was coming out garbled, I finally realize that:

  • even if your tables and database are all created with utf8 charset;
  • even if you set your connection charset to utf8 with SET NAMES 'utf8';

your scalar results in perl will not have the utf8 flag set, so any print, concatenation or XML generation further on will result in a mess, when finally printed out to a XMPP stream, for example.

So, on all your code, after you retrieve data from MySQL, you must set the utf8 flag on that scalar.

For now I'm using this code. Probably not the best one, but it suffices for now.

if (! utf8::is_utf8($message)) {
  utf8::decode($message);
}

There is some discussion about this online. It seems that the DBD::mysql people are waiting for a general solution for the problem to appear in a future version of DBI. There is also a patch floating around that sets the flag on utf8 content.

If you use Class::DBI, you can also look at Class::DBI::utf8 that does the right thing.

Regarding support for this in DBI itself, there is a thread by Tim Bunce that talks about utf8 support in DBI in a future version, in particular bullet 4 of the initial post. But the next bullet points the responsibility of the utf8 flag to the drivers.

This quote in particular should self explanatory about Tim's reasoning:

Some features, like charsets, vary greatly in how they're handled by database APIs. For these kind of features the DBI usually lags the drivers. Once a few drivers have implemented their own driver-specific interfaces, and had them proven as practical by users, then I can work with driver authors to see how best to extend the DBI API in a way that'll work well for those drivers and others.

And a more specific one regarding DBD::mysql:

Basically it should be the job of the drivers to set the uft8 flag on data being retrieved if it is utf8. I believe that the new mysql v4.1 protocol does provide information about the characterset of each colum. DBD::mysql can use that.

I would like to see that patch into the DBD::mysql mainline. It seems that Tim Bunce is passing on the responsibility of the utf8 flag to the driver author. It makes some sense. If the DBI layer was responsible to set the flag, it would need to obtain charset information from the DBD driver anyway. In that case, if the driver already knows which charset it is using, why not just set the flag? This would make it easier to work with utf8 in the meantime...

Stay tuned for the next chapters in the utf8+DBD::mysql saga...

Update: another interesting link about MySQL, utf8, and Moveable Type.

Technorati Tags: , , ,