Just another data point

Sunday, 18 October 2009

(Update: I've pushed my code, including three new scripts, to the nfsd_report_bench/ directory on my examples repository. See below for some clarifications based on comments I received).

A former colleague of mine at PT had a small reporting problem, and he ended up comparing several languages for the job: C, Perl, PHP, and Python.

I was curious about the results, so I took the latest version of the Perl script that he was using and set off to work.

The first thing that you should be aware is where your bottleneck is. Take a look at this small script:

#!/usr/bin/env perl

use strict;
use warnings;

my $lines;
while () {
#  my @fields = split / /;
  $lines++;
}
print "$lines\n";

A basic line counter. Compare it to the system wc:

$ gzcat nfsd.gz | time wc -l
 12236390
       16.86 real        16.24 user         0.43 sys
$ gzcat nfsd.gz | time ./wc_simple.pl 
12236390
       10.13 real         8.51 user         0.84 sys

So, a bit faster than the C version, doing 1.2M lines per second on my laptop.

But if you remove the comment on the split(), we have:

$ gzcat nfsd.gz | time ./wc_simple_with_split.pl
12236390
       228.78 real       224.39 user         2.34 sys

A lot less: 53k lines per second.

So the first bottleneck is the split(). Lets improve on that. After some attempts I came up with this:

my $lines;
while () {
  my ($ts)    = /^(\d+)/gc;
  my ($type)  = /(\w)\sV/gc;
  my ($op)    = /\s(\d\d?)\s\w/gc;
  my ($bytes) = /\w\s(\d+)\s/gc;
  $lines++;
}

I make use of the gc flags to start the next match where the previous one ended. I also take advantage of patterns in the lines that I need to match, like the V in the NFS version.

With this version we get:

$ gzcat nfsd.gz | time ./wc_simple_with_regexps.pl
12236390
      109.62 real       107.42 user         1.55 sys

A bit better: 111k lines per second, a bit over 2x the previous result.

If we apply this gain to the reported times for the Perl script (419m11.267s), we get 200m49.97 which places Perl 1st, even above the C version by a minute or so.

My adjusted version of the stats_basic_optimized.pl script is melo_stats_basic.pl.

This was good enough I think, but I wanted to study the I/O gains that could be made. Stop reading here if you are already bored, it will only get worse. We did manage to get a few more minutes back, though... ;)

So I went after the I/O performance. Fist I wanted to rule out the pipe as the bottleneck.

$ cat /dev/zero | pv | cat > /dev/null
4.61GB 0:00:10 [ 498MB/s]

$ gzcat nfsd.gz | pv | cat > /dev/null 
1GB 0:00:04 [ 242MB/s]

So the pipe is not the bottleneck but we will never reach the full speed, gzcat will be our limitation.

I did try to read the gzip file directly into the Perl script and uncompressing it there, but it was very slow.

So assuming a limit of 242MB/s on the input side, how fast are we chopping lines? The size of the input is 1073743224 bytes, and our simple wc_simple.pl (no split, no regexp) took 10.13 seconds above, so we are chopping the input at a rate of 101MB/s.

So there is room to grow there (101MB/s to 242MB/s). I did some experiments:

$ gunzip nfsd.gz
$ time wc_simple.pl < nfsd
12236390

real    1m23.212s
user    0m7.434s
sys 0m2.166s

Yeah, the file doesn't fit in the cache like the gziped version, so there is real I/O, and the times go through the roof.

There is no point doing experiments with the big nfsd file. All of them will result in real I/O, and that is always slower than memory.

Lets try to do bigger reads and parse the results:

use strict;
use warnings;

my $size = shift || 2 ** 20; ## 1Mb default
my $offset = 0;
my $buf = '';
my $lines = 0;

while () {
  my $n = sysread(\*STDIN, $buf, $size, length($buf));

  while ($buf =~ /.+\n/gc) {
    $lines++;
  }
  last unless $n > 0;

  print "$lines $n\n" unless $lines & 0x1ffff;
  $buf = substr($buf, pos($buf));
}
print "$lines\n";

I tried several block sizes but with my OS the most I could read in one call was 64k. So asking 1MB and getting 64k reads we get:

$ gzcat nfsd.gz | time  ./wc_batch.pl
12236390
    8.37 real         7.52 user         0.58 sys

We get 1.46M lines per second, a 17% improvement. Lets adjust to retrieve our fields. The inner loop becomes:

while ($buf =~ /(.+)\n/gc) {
  $_ = $1;
  my ($ts)    = /^(\d+)/gco;
  my ($type)  = /(\w)\sV/gco;
  my ($op)    = /\s(\d\d?)\s\w/gco;
  my ($bytes) = /\w\s(\d+)\s/gco;

  $lines++;
}

and the runtime:

$ gzcat nfsd.gz | time  ./wc_batch_with_regexps.pl
12236390
   95.58 real        93.21 user         1.14 sys

So from 109.62s to 95.58s, 12% better (or comparing with our baseline wc_simple_with_split.pl at 228.78s, 58% better). Adjusting this to the reported results we would go from 419m11.267s down to 175m5.6842s.

I don't think I can improve on this unless we can have a bigger pipe reads. For example, forcing the reads to 8k:

$ gzcat nfsd.gz | time  ./wc_batch.pl 8192
12236390
       12.08 real        10.93 user         0.70 sys

A lot worse compared with the 8.37s we got with 64k reads. So the size of the pipe is the next factor we could explore, if that is even an option with your kernel.

But I'm happy now.

The next day

Or so I though. First there was doubts that the split() was faster than regexps. I wrote bench_splitters.pl (output on my laptop, download link) to compare split with my regexps. The regexps are a bit over twice as fast, but I found big differences between the Mac OS system perl (5.8.8 on my Leopard OS) and the 5.10.1 that I compiled: system perl was between 20 and 30% faster.

The same bench_splitters.pl gives you the max rate of extraction that you can expect from the global script. I also included timing of the bookkeeping parts of the original script. The only noteworthy detail is the fact than when you hit the second level condition, you pay the price of the modulus operator big time. I also think that something is wrong with the input. Those time stamps don't look like normal second-precision time stamps. They are too big. So I don't know if $ts % 3600 is the proper way to group performance by hour.

Second I wrote a max_line_rate.pl (output on my laptop, download link) that gives you the upper bound on the max rate that you can expect while parsing the required fields. You can run this script, and stop it at any point in time with ctrl-c, and it will print a performance report up to that point. Every 128k lines, a single line performance report is also printed.

You can use this max_line_rate.pl to compare your system perl with the 5.10.1 you compiled. I had much better performance with 5.8.8 in this particular application.

Finally I rewrote the statistics script. I did that to deal with the report that my previous version was consuming 7.5Gb of RAM. The reason is simple enough: I don't have access to the original input, only to a six-line excerpt that was posted. Therefore the regexps I use to extract the required fields might fail.

The new script, fast_stats.pl (output on my laptop, download link), is more robust, and should deal with lines that cannot be parsed: it will print the line that couldn't be matched and ignore it. Also: I've included the output of the ps command at the start and end of both fast_stats.pl and max_line_rate.pl to show that the RSS doesn't change that much.

To compare the original stats_basic_optimized.pl with my fast_stats.pl I wrote a small shell script bench.sh (output on my laptop). The nfsd.gz input file was generated with the build_source_file.pl script with the command:

build_source_file.pl 1073741824 | gzip --best > nfsd.gz

The new fast_stats.pl is almost twice as fast as the old one on my laptop.

On a final note (I wasted too much time already on this...), I'm not out to compare Perl with Python or Ruby or even PHPO. But I would like to know how we measure up against C though. The reason is simple: when Perl programmers feel that something is slow, they turn to C, not another scripting language.

This experiment is mostly to show that writing fast perl will sometimes take you to unexpected paths (like regexps beating a split), and that you should benchmark carefully if performance is critical to you.