« August 2009 | Main | October 2009 »

September 26, 2009

Generating charset_table maps for Sphinx

At work, I wanted to improve the back-office search engine and installed Sphinx.

On a regular basis, it indexes all our users, courses, teachers and other important tables. It works great, low install barrier, low maintenance, and it is very very fast. Perfect.

One of the problems that we found, that limited the usefulness of the full-text search engine, is that a lot of our text has accents, and it would be better to ignore those. Also we don't need case-sensitive-ness.

So I needed to generate a charset_table map, what Sphinx uses to normalize the text that you give him to index.

And being a (very) lazy person, I prefer to write a Perl script to do it. The result is the x-sphinx-charset-generator, now part of my script stash.

It takes an optional parameter, the charset that you are using on your text defaulting to 'utf8', the loose version of UTF-8, and generates a charset_table for the most common accented characters, mapping them to the lower-case version of the same letter without the accent.

I've only include the common Portuguese characters. Patches accepted for others characters that you might need.

The only part that I don't really like is that I need to apply the same logic to cleanup the strings that users use to search. I would prefer to have a module that would take the characters that I want to allow as valid, and have that module provide the charset_table and a function to clean search inputs. Interesting, but for now this will solve the important 80% of the problem.

September 24, 2009

Tip: use pv to monitor mysql loads

If you are fortunate enough to be able to reload your development databases from time to time with production data, this might help.

The usual command you would use is something like this:

gunzip -c db-backup.sql.gz | mysql -udevuser -ppass db_dev

If your dump is big this can take a while and you wont have a clue about what is happening.

Instead, install pipe viewer and do:

gunzip -c db-backup.sql.gz | pv | mysql -udevuser -ppass db_dev

and you get a nice speed meter. For even better results:

dump=db-backup.sql.gz
size=`gunzip --list $dump | perl -ne 'print "$1\n" if /\d+\s+(\d+)\s+\d/'`
gunzip -c $dump | pv -s $size | mysql -udevuser -ppass db_env

and you'll get a speed meter and an ETA. The second command will get the uncompressed size of the dump and use that to teach pv how much data to expect.

Pipe viewer rockz.

September 19, 2009

A new look at Mason

On my way to E5, I have to deal with all the legacy sites that came before it, and the vast majority of them are written in Mason.

At the time, we used HTML::Mason 1.05, and only after 1.30-ish (when the internal buffering changes introduced in the 1.10 release where reverted) did we upgraded to something more recent. To get an idea of the size of this Mason project, the current sites have a little over 450 different components across 7 sites (different layouts but same content) and 1 management site. A lot of those components are no longer in use, and where left there due to bad VCS practices. I would estimate about 200 actual useful components. And it makes heavy use of autohandlers, dhandlers, and multiple component roots, to implement site inheritance.

The E5 template discussion is also still on my mind, without a clear winner yet.

And finally, last week I read Jonathan Swartz article about what Mason 2.0 would look like.

So I took the time last night to re-evaluate Mason. As most powerful tools (and be sure that it is a very powerful tool), the problem with Mason is that you can easlly make a big mess of things. More than with other solutions like Catalyst and Mojo, that provide you with a clear separation of Controller, Model and View, Mason make is very easy to mix the three. You could end up with a lot of logic that should be in your models inside your templates.

But it has several advantages:

  • its easy to start and add a new page. Just create the file and start typing: no need to jump between controller and template, and a restart (this alone makes for speedy development);
  • it has decent wrapper functionality for skinning: autohandlers are great;
  • the multiple component roots logic is very powerful, and its used both during the dispatch phase and component calls;
  • the view logic is Perl: no need to learn a new language and be exasperated with their limitations like TT.

There are several downsides of course: for one, the split of Controller/View of modern frameworks allows you to reuse controller logic with multiple views. For example, you could output HTML, JSON and XML with the same controller code.

But the biggest downside is this: deployment is a bitch.

For production environments, deployment usually means mod_perl but I find FastCGI easier to deploy now-a-days. Yet, this option is only briefly mentioned on the MasonHQ site.

So I created a small experimental project (you can find all the files at the exp-mason-fcgi project on GitHub). It has a FastCGI startup script to power two virtual hosts. Each one shares a master component root, and has a local per-site component root to override the master site behavior when needed.

The setup works just fine under nginx+FastCGI (partial nginx.conf included), but I did get into some trouble. Mason usually delegates some stuff to Apache and without his big daddy around, he can get lost.

The first problem is directory index files. When you request a directory, Apache will help Mason out and point it to the proper index.html file. Without Apache, request to http://your-fastcgi-mason-site/ will just fail, because Mason cannot find the component for /. It has no logic to map / into /index.html for example.

You could implement this with a global dhandler, and it is probably the best solution, because it can also deal with 404 situations (another one that Apache could cleanup after).

I have a proof of concept hack in the repo that mimics the Apache DirectoryIndex directive. It is a hack, it should be in the Interp.pm and not in the Request. I'll clean it up later. But it does work, and it might be useful in some scenarios.

This patch makes a / request work just fine.

The second problem I have with Mason is the order of evaluation of templates. The current order for a request to /index.html is /autohandler which calls $m->call_next and that calls the /index.html component.

This makes it hard to influence the wrapper with content generated by the /index.html component.

The solution was to create a new HTML::Mason::Request method called scall_next(). It merges the $m->call_next() and the $m->scomp() calls into one, and allows me to use it in a autohandler like this. The $m->scall_next() will call the next component, get the generated HTML, and only then generate the HTML wrapper.

The last piece of the puzzle is a way for the /index.html and the parent autohandler to communicate, for example, to pass along the title for the page.

The Mason-recommended way is to use $m->notes() API, similar to the Catalyst stash concept. It works very well, but I prefer to take advantage of the fact that all components live inside the HTML::Mason::Commands namespace and just declare a shared %stash there and clean it up per request. With this, its easy to implement dynamic page titles.


All in all, I keep coming back to Mason. I do like most of what it provides, and for quick sites, it beats all the other alternatives in Perl-land.

A couple of months ago there was some discussion about a option to give web developers that was FTP friendly, in that you could upload your pages to a server and it would just work. Mason is the closest that Perl has to that goal.

But the deployment must be made simpler, and that I one of the things that Mason 2.0 should focus on: make it simpler to deploy.

So my laundry list for Mason 2.0 (and most of them can be implemented on 1.x):

  • FastCGI support out-of-the-box;
  • support for directory index files (without using dhandlers);
  • $m->scall_next;
  • better hooks for debugging and 404 errors.

I don't know. I'm strongly considering go Mason all the way for E5, and if that goes forward, I guess I'll have to write these four pieces myself.

September 16, 2009

Bitten by prototypes

I just spent the best part of an hour around a problem caused by the behavior of Perl prototypes.

I used the following test case to figure it out:

use Test::More tests => 1;
use Encode qw( encode decode );

sub u8l1 {
  return encode('iso-8859-1', @_);
}

my $ola_u8 = decode('utf8', 'Olá');
my $ola_l1 = encode('iso-8859-1', $ola_u8);
is(u8l1($ola_u8), $ola_l1);

The output of prove x.t is this:

t/x.t .. 1/1 
#   Failed test at t/x.t line 12.
#          got: '1'
#     expected: 'Ol?'
# Looks like you failed 1 test of 1.
t/x.t .. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests

The got: '1' had me for quite some time. Until I changed the u8l1() helper to this:

sub u8l1 {
  return encode('iso-8859-1', $_[0]);
}

And it just works.

The problem is the definition of the Encode::encode() function. It has a prototype like this:

sub encode($$;$)

So our @_ is interpreted in scalar context, and so evaluates to the number of parameters, 1.

I don't like it at all because it changes the standard Perl behavior of expanding lists. Its action at the distance. The fact that you cannot pass a single element list is also not mentioned in the documentation.

The only really useful use of Perl prototypes is using a & as the initial char, that allows you to write a function that looks like some built-ins like sort or map, that take a anonymous sub as the first parameter.

September 12, 2009

Stupidity

Just finished watching the first, and only, season of Firefly).

I have only one question: who was the monster stupid that cancelled this show? What was he thinking?

September 10, 2009

Log::Log4perl tip

I use Log::Log4perl for all my logging needs. Ok, I lie. I use a wrapper that deals with some stuff that I just don't like with Log::Log4perl, but that is a story for another day.

One thing that we inherited from log4j was the notion that a message can match multiple loggers in your logging hierarchy.

The logic is simple and explained in detail on a Log::Log4perl FAQ entry. If you write something like this in your logger configuration file:

log4perl.logger.Cat        = ERROR, Screen
log4perl.logger.Cat.Subcat = WARN, Screen

which define two loggers. Cat and Cat.Subcat, the second a subcategory of the first, and then use:

my $logger = get_logger("Cat.Subcat");
$logger->warn("Warning!");

you'll get a duplicate message in your log file because it matches both loggers.

I knew that and I always added a line saying:

log4perl.additivity.Cat.Subcat = 0

that prevented this behavior, but this required a line like that per logger. Pain. Not lazy, at all.

But for some reason (stupidity comes to mind) I didn't read the FAQ completely, because at the end, there is a solution. Just put this in your logger configuration file:

log4perl.oneMessagePerAppender = 1

Bliss, pure bliss.

Mind you that oneMessagePerAppender is not compatible with log4j, something that Log::Log4perl tries very hard to be, and therefore this feature is not documented at all except on this FAQ entry.

Problem solved

I keep a pad of paper between me and my keyboard at all times. I used to take down notes, keep track of what I need to do today, small brain dumps, and random scribbles.

But with my hands going about their bussiness, the corner of the paper starts to bend upwards.

Problem

The solution is not rocket science. A simple paper clip.

Solved

Problem solved.

September 06, 2009

Last chance to see

The book turns into a TV show.

It might not have a comparison between riding a manta ray or riding a air-powered submersible thingie, but it shouldn't be too bad, given the two names associated with it.

Bootstrap Perl

Whenever a new version of Perl is released, I install it in a separate directory and re-install all my modules into a new local::lib-powered directory.

This takes a lot of time, but I had most of the process already in auto-pilot.

But still it was a hack, so I decided to take the opportunity of the 5.10.1 release and make something more pretty and reliable.

The result is my Perl bootstrap repo.

There are two scripts. The first, bootstrap.sh, will install the local::lib module and prepare the environment. Its still not finished, it doesn't alter the .bashrc file, but it will get there.

The second, install_deps.sh will use the cpan shell to install a local Task::Bootstrap module. This Task has all the modules that I want installed.

There are still some problems. I still lack some distro prefs for a couple of them that pause the process and ask for user input. And some of the modules won't install without force (Mac::Carbon is the one that fails the most).

Other modules just don't install correctly on Mac OS X. Danga::Socket for example, requires Sys::Syscall, but this one fails the tests because Mac OS X lies about sendfile support: the sys/syscalls.ph includes the SYS_sendfile constant, but when you actually call it, we get a Function not implemented. I'm sure I could work around it, and probably fix it, but I no longer use Danga::Socket so I'll probably just remove that dependency.

The other was Mac::AppleEvents::Simple. Finder.app has a different naming scheme for its windows, and t/simple.t was failing. I've send a patch to the module RT Queue.

I still have small failures, but right now, I can mostly use this two scripts to setup a Perl environment from bare metal.

Update: I removed my ~/.perl5/5.10.1/ directory and ran time ./bootstrap.sh. The results:

real 56m4.598s
user 37m17.724s
sys  7m51.923s

So about an hour on a MacBook Pro 2.16Ghz Core Duo, running Leo.

September 04, 2009

SAPO turns 14

So 14 years ago a project was born in Aveiro. Just a couple of guys (there is a picture of them, but it was buried somewhere due to hair style issues), love for technology and an idea.

Today the company has over 200 persons working there, keeps sharing their technology with us all, gives us access to a bunch of very cool APIs, and invites us to have fun from time to time.

I was lucky enough to work there for a couple of years, and do hope to do it again sometime in the future, it was a lot of fun.

For now, a really big happy birthday to SAPO.

Contacts

melo@simplicidade.org (XMPP/email)
+351 302 029 050 (voice)
melopt (Skype)

IronMan challenge

Iron Man badge Are you ready to be an Iron Man? Join the challenge and find out! (what is the meaning of this little man?)

Moosaico

Junta-te!

Recent Comments

Powered by Disqus
Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.2