Performance Tuning

Faster FASTA, please

A user on StackOverflow asked "What is the best way for dealing with very big files?". The end of the question asks "Did I erach the maximum performance for the current version of Perl6?". In this post I'll go through multiple implementations of splitting up a FASTA file.

The other day on Stack Overflow: User Beuss asked "What is the best way for dealing with very big files?". The question features the task of parsing a FASTA file, which is a common file format in Bioinformatics. The format is extremely simple. It's based on lines, where a line starting with a > gives an identifier which is then followed by any number of lines containing protein sequences or nucleic acid sequences. The simplest you'll see will contain the letters A, C, G, and T, though there are far more letters that have meaning, and extra symbols. The "parsing" part of the question was limited to efficiently splitting the file up in its individual sequences and storing each sequence in a hash keyed on its identifier.

The end of the question asks, specifically, "Did I reach the maximum performances for the current version of Perl6 ?". This immediately caught my eye, of course.

In this post I'd like to take you through my process of figuring out the performance characteristics and potential gains for a program like this.

Let's start with the original code suggested by Beuss and make a few tiny modifications: The original Stack Overflow code was lacking the class seq, but as far as I could tell it needed nothing but two attributes. There was a line that would output every id it came across, but I/O like that is really slow, so I just removed it. I also added very simple timing code around the three stages: Slurping, splitting, and parsing. Here's the result:

my class seq {
  has $.id;
  has $.seq;
}
my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    my $id;
    my $s;
    my $now = now;

    say "Slurping ...";
    my $f = $!file.IO.slurp;
    say "slurped in { now - $now }";
    $now = now;

    say "Splitting file ...";
    my @lines = $f.split(/\n/);
    say "split in { now - $now }";
    $now = now;

    say "Parsing lines ...";
    for @lines -> $line {
      if $line !~~ /^\>/ {
          $s ~= $line;
      }
      else {
        if $id.defined {
          %!seq{$id} = seq.new(id => $id, seq => $s);
        }
        $id = $line;
        $id ~~ s:g/^\>//;
        $s = "";
      }
    }
    %!seq{$id} = seq.new(id => $id, seq => $s);
    say "parsed in { now - $now }";
  }
}

sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

And let's generate an example genome.fa file to test it out with. This one-liner will give you a genome.fa file that's 150_000_200 characters long, has 2_025_704 lines in total, 192_893 of which are lines with identifiers, and the remaining 1_832_811 lines are sequence lines with 80 characters each.

perl6 -e 'srand(2); my $f = "genome.fa".IO.open(:w); while $f.tell < 150_000_000 { $f.put(">" ~ flat("A".."Z", "a".."z", "0".."9", "_", "-").roll((5..7).pick).join); $f.put(<A C G T>.roll(80).join()) for ^(3..16).pick }'

This script has not been optimized for performance at all ;)

Okay, we're just about ready to go with this. Let's have a look at how long only the first two stages take by just hitting Ctrl-C after it outputs "Parsing lines ...":

Slurping ...
slurped in 1.4252846
Splitting file ...
split in 30.75685953
Parsing lines ...

Huh. That's pretty darn slow, isn't it? 67k lines split per second? We should really be able to do better than that. Let's zoom in on the slurping and splitting:

say "Slurping ...";
my $f = $!file.IO.slurp;

say "Splitting file ...";
my @lines = $f.split(/\n/);

My experience with Rakudo has taught me many times that currently our regexes are much more expensive than they have to be. Even though this regex is extremely simple, the regex engine is currently an all-or-nothing deal.

Let's use the built-in method lines on Str instead and see how that fares:

Slurping ...
slurped in 1.4593975
Splitting file ...
split in 2.9614959
Parsing lines ...
parsed in 32.9007177

Cool, that's already a 10x as much performance for just the splitting! If I had let the program run to completion before, the whole program's run time would have been 50% slurping and splitting and 50% parsing. But if you look at the parsing part, there's two more regexes in there, too:

for @lines -> $line {
  if $line !~~ /^\>/ {
      # ...
  }
  else {
    # ...
    $id ~~ s:g/^\>//;
    $s = "";
  }
}

Can we do the same thing without regex here, too? Sure! $line !~~ /^\>/ is equivalent to the much, much faster not $line.starts-with(">"), and since we already know in that branch of the if statement that the line starts with > we can replace $id ~~ s:g/^\>// with just $id .= substr(1). Let's see what happens to the performance now:

Slurping ...
slurped in 1.463816
Splitting file ...
split in 2.9924887
Parsing lines ...
parsed in 3.8784822

Cool. It's about 8.5x as much speed. In total, it used to take about 1m10s, now it takes 8.6s.

Second implementation

Let's switch gears for a bit. Before I came into the discussion, Stack Overflow user Christoph already came up with a good answer. They also immediately had the instinct to cut out the regex/grammar engine to get a speed-up. The first suggested piece of code looks like this:

my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
    .[0] => .[1..*].join given .split("\n");
}

It works like this: It splits the whole file by the > character. Now every chunk after the split is a string consisting of the ID line and all FASTA sequence lines that come after it and before the next ID line - except of course if there's a > in the middle of some line. ^[1]

Since the file itself starts with a > character ^[2] we have to skip the very first entry, as it would just be an empty string. The code does that with array slice syntax [1..*]. Then in the block it splits the individual strings that each start with the ID line followed by the sequence data into lines.

I like this answer a lot. It's short and sweet, but it's not golfed to the point of being unreadable. Let's see how it performs!

time perl6 christoph-fasta.p6 
38.29user 0.53system 0:38.85elapsed 99%CPU (0avgtext+0avgdata 1040836maxresident)k

Whoops, that's very slow compared to our optimized code from above! Since this program is mostly methods from built-in classes, we'll most likely have to find more efficient versions of what we've got in the code right now.

The slurp and split invocations in the beginning are probably as fast as we can get, but what about using [1..*] to skip the first element?

split returns a Seq object, which is the Perl 6 user's way to work with iterators. One important feature of Seq is that it throws away values after they have been consumed. However, if you use array accesses like [1], [4, 5, 2, 1] it can't do that. The code doesn't know if you're going to have lower indices later in the list, so writing that last example would lead to an error. So it caches the values - literally by calling the cache method on the Seq.

Surely there's a way to skip a single element without having to memoize the resulting list? Turns out that there is: The skip method is one of the few methods on the Seq class itself! Let's go ahead and replace the first [1..*] with a call to skip. Another thing we can do is replace .[0] with .head and the other .[1..*] with a .skip(1) as well. For these to work we'll have to add our own .cache call on the .split, though. Here's the code we end up with:

my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
    .head => .skip(1).join given .split("\n").cache;
}

And here's the run time:

time perl6 christoph-fasta-no-circumfix.p6 
12.18user 0.57system 0:12.79elapsed 99%CPU (0avgtext+0avgdata 1034176maxresident)k

That's already better, but no-where near where I'd like it to be. However, I couldn't yet come up with a way to make this variant any faster.

Third Implementation

Stack Overflow user Christoph also had a second implementation in their answer. It's based on finding the next interesting character with the .index method on strings. Let's see how it compares! Here's the code in full:

my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my $pos = 0;
loop {
    $pos = $data.index('>', $pos) // last;

    my $ks = $pos + 1;
    my $ke = $data.index("\n", $ks);

    my $ss = $ke + 1;
    my $se = $data.index('>', $ss) // $data.chars;

    my @lines;

    $pos = $ss;
    while $pos < $se {
        my $end = $data.index("\n", $pos);
        @lines.push($data.substr($pos..^$end));
        $pos = $end + 1
    }

    %seqs{$data.substr($ks..^$ke)} = @lines.join;
}

And a first timing run:

time perl6 christoph-two.p6 
15.65user 0.44system 0:16.05elapsed 100%CPU (0avgtext+0avgdata 1011608maxresident)k

Now that doesn't look too bad. It's already faster than the previous implementation's first version, but a bit slower than the final version I presented just above.

Let's grab a profile with the profiler and see what we can find!

Opening up the routines tab and sorting by exclusive time, we immediately see something rather suspicious:

Selection_051

Here we can see that the Range construction operator ..^ is responsible for a big chunk of time – almost a third – and the substr method is responsible for almost a quarter. The new method just below that isn't actually interesting, as it's just always called by the ..^ operator, and as such has its time captured in the operator's inclusive time.

So what are we using ..^ for? The code uses the form of substr that passes a Range instead of a start and length argument. If you have a start and an end position like in this case, it's a whole lot nicer to look at. Unfortunately, it seems to suffer from a large amount of overhead.

Let's rewrite the code to use the .substr($start, $amount) form instead. The transformation is very simple:

@lines.push($data.substr($pos..^$end));
# becomes
@lines.push($data.substr($pos, $end - $pos));
# and
%seqs{$data.substr($ks..^$ke)} = @lines.join;
# becomes
%seqs{$data.substr($ks, $ke - $ks)} = @lines.join;

And now we can time the result:

time perl6 christoph-two-no-range.p6 
8.34user 0.44system 0:08.72elapsed 100%CPU (0avgtext+0avgdata 1010172maxresident)k

Great result! We've shaved off around 45% of the run time with just our first find!

What else can we do? Let's see if sprinkling some native types would help gain performance for this script. Let's make all the integer variables be typed int, which is a native 64bit integer instead of the potentially infinitely big Int. We can also turn the @lines array into a native string array, saving us one layer of indirection for every entry we have. Here's the full code:

my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;
loop {
    $pos = $data.index('>', $pos) // last;

    my int $ks = $pos + 1;
    my int $ke = $data.index("\n", $ks);

    my int $ss = $ke + 1;
    my int $se = $data.index('>', $ss) // $data.chars;

    my str @lines;

    $pos = $ss;
    while $pos < $se {
        my int $end = $data.index("\n", $pos);
        @lines.push($data.substr($pos, $end - $pos));
        $pos = $end + 1
    }

    %seqs{$data.substr($ks, $ke - $ks)} = @lines.join;
}

And here's the timing:

time perl6 christoph-two-no-range-native-int.p6 
6.29user 0.36system 0:06.60elapsed 100%CPU (0avgtext+0avgdata 1017040maxresident)k

Oh, huh. That's not quite an amazing improvement, but let's see if we can push it further by turning $data into a native string! Surely that'll give us a little speed-up?

time perl6 christoph-two-no-range-native-int-str.p6 
7.16user 0.36system 0:07.45elapsed 100%CPU (0avgtext+0avgdata 1017076maxresident)k

Isn't that interesting? Turns out that in order to call a method on a native string, rakudo has to create a temporary Str object to "box" the value into something that can have methods and such. That means that every method call on $data will create one shiny Str object for us. That's not quite what we want ☺

Conveniently, there are also sub forms of index and substr. We can either rewrite the method calls to sub calls and move the invocant (this is how we refer to the thing the method is called on) to be the first argument, or we can use the convenient "use sub but with method syntax" feature Perl 6 has. It looks like $data.&substr($ks, $ke - $ks) and all it does is put the invocant as the first argument of the sub and the rest of the arguments follow.

Unfortunately, there aren't actually candidates for these subs that will take native strings and ints, and so we'll end up with the same problem!

Eliminating boxed objects, Str, Int, Num, and similar things, is actually on the agenda for MoarVM. Recent improvements to the dynamic specializer "spesh" by jnthn have been laying the foundation on top of which improving this situation should be doable.

Illegal Performance Gains

So is this the most we can get? Not quite. That was actually a pun, because there's a thing called NQP, which stands for "Not Quite Perl". It's both a separate language with much stricter rules that rakudo itself is written in, and the namespace under which most of the low-level operations that the VM knows are available. These ops are not part of the Perl 6 Language Specification, and the rakudo developers do not guarantee that any code you write using NQP ops will continue working on newer versions of rakudo.

What it does allow us to do is find out where the performance ceiling is, roughly. I'll first write the code to use NQP ops, and then I'll explain what I mean by that.

use nqp;

my Mu $seqs := nqp::hash();
my str $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;

my str @lines;

loop {
    $pos = nqp::index($data, '>', $pos);

    last if $pos < 0;

    my int $ks = $pos + 1;
    my int $ke = nqp::index($data, "\n", $ks);

    my int $ss = $ke + 1;
    my int $se = nqp::index($data ,'>', $ss);

    if $se < 0 {
        $se = nqp::chars($data);
    }

    $pos = $ss;
    my int $end;

    while $pos < $se {
        $end = nqp::index($data, "\n", $pos);
        nqp::push_s(@lines, nqp::substr($data, $pos, $end - $pos));
        $pos = $end + 1
    }

    nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", @lines));
    nqp::setelems(@lines, 0);
}

Let's go through it piece by piece. The first line is new, it's use nqp;. It's a synonym for use MONKEY-GUTS; which is a bold declaration meaning in essence "I know what I'm doing, and I deserve whatever I've got coming to me".

We'll use a low-level hash object taken from the nqp world by binding to a scalar variable. We use the type constraint Mu here, because nqp types aren't part of the Perl 6 type hierarchy, and thus will not go through a type check for Any, which is the default type constraint for scalar variables. Also, it does not do the Associative role, which is why we can't bind it to a variable with a % sigil.

Next, we'll pull out the @lines array, so that we don't have to allocate a new one for every round through the loop. We don't have to use nqp::list_s() here like with the hash, because the native string array you get from my str @foo has barely any overhead if we use nqp ops on it rather than methods.

I've removed usage of the // operator, though I am not actually sure how much overhead it has.

The signature of nqp::index is the same as the three-argument sub version of index, and the same is true for the nqp::substr op. There's also an nqp::join op that will only accept a native (or literal) string as first argument and a native string array as the second argument.

You'll also notice that the $end variable is now outside of the inner loop. That has a relatively simple reason: A block that introduces lexical variables cannot be inlined into the outer block. That means that the inner block has to be invoked as a closure, so that it has access to all of the relevant variables. This adds the combined overhead of invoking a code object and of taking a closure. The Garbage Collector has to sweep all of those closures up for us. It'll be best not to generate them in the first place.

We use the nqp op nqp::push_s to add to the @lines array because the regular nqp::push op works with "objects", rather than native strings.

Then there's something that has no corresponding piece of code in the previous version: nqp::setelems(@lines, 0). Since we keep the @lines array around instead of building a new one every time, we have to empty it out. That's what nqp::setelems does, and it's very cheap.

A profile of this code tells us that all that's left being allocated is Str objects, exactly 192_899 of them. This comes from the fact that the hash wants to store objects, not native strings.

Let's see what the run time is!

time perl6 christoph-two-no-range-native-nqp.p6 
2.04user 0.33system 0:02.27elapsed 104%CPU (0avgtext+0avgdata 1004752maxresident)k

Whew! Our fastest implementation so far took 6.6s, now we're down to 2.3s, which is close to a third of the time the second-fastest version takes.

What's a "performance ceiling"?

Everything our code does will at some point end up using nqp:: ops to actually do work. Have a look at the substr method of Str, the index method of Str, the push method of strarray, and the ASSIGN-KEY method of Hash, that sits behind postcircumfix:<[ ]>. In between our code and these ops there are often multiple layers of methods that make things more comfortable to work with, or check values for validity.

Rakudo's static optimizer already works towards simpler code consisting of fewer calls. For example it would replace a call to infix:<+> with a low-level nqp::add_i op if it knows only native ints are involved. MoarVM's dynamic specializer has a lot more knowledge to work with, as it watches the code during execution, and can speculatively inline sub and method calls. After inlining, more optimizations are available, such as removing boxing that was necessary for passing arguments into, and returning the result out of the routine – this is currently being worked on!

If the MoarVM specializer were flawless, it ought to be able to generate the equivalent of what I coded up by hand. It will not do something as bold as keeping that one array around between rounds and just clearing it at the right moment, as it is currently not able to prove that the array doesn't get stashed away somewhere. But all in all, most of what I did should be achievable with more intelligence in the specializer.

The word "performance ceiling" is still not quite accurate, though. There's still a lot of optimization potential in the JIT compiler, for example. Bart Wiegmans just blogged about a benchmark where recent improvements to MoarVM got us to the point where the code was only 30% slower than an equivalent implementation in C. That was mostly due to the focus of the code being floating point operations, which likely take so long individually that imperfect code-gen is less of a problem.

But this is about the most we can get from the current version of rakudo, unless we find a better algorithm to do what we want.

The Elephant in the RAM

One thing that you've surely noticed is that this program uses about one gigabyte of memory at its highest point (that's what "maxresident" means). Sadly, this is a property of MoarVM's string implementation. In order for grapheme-level access to be fast ("linear time"), we upgrade strings to 32 bits per grapheme if needed, rather than storing strings as utf8 internally. Our strings also support 8 bits per character storage, which I had expected to be used here, but something in the machinery upgrades the string data from 8 bits to 32 bits, even though all character values ought to fit.

In the medium to far future, we'll also get strings that sacrifice linear time access for storage efficiency, but we're not at that point just yet.

Is there something else we could do to get around this? Sure! Instead of saving the string we cut out of the source file with substr and concatenated with join, we could save the start and end value of every piece of string in a native int array. We could implement a Hash subclass or compose in a role that grabs the data from the source file whenever the user asks for it. Native int arrays are much faster to GC than string arrays, and if you instead hold an index into a single giant int array in the hash, you can reduce the pressure on the GC even further!

That's a task for another post, though, as this one rapidly approaches 4k words.

Yet another task would be to write the same program in perl 5, python, and/or ruby. It should be interesting to compare performance characteristics among those. Surely our fastest code is still slower than at least one of these, but having a target in sight could help figure out which parts exactly are slower than they should be.

I personally don't code any ruby or perl 5, so I'd be happy if someone would contribute those implementations!

Parting QWORDS

Thanks for sticking with me for the duration of this huge post, and thanks to raiph for requesting this article. It may have been a bit more work than I had anticipated, but it was fun, and hopefully it is interesting for you readers!

And here's the QWORDS I promised: 0x7fffffffd950, 0x7ffff3a17c88, and 0x7ffff62dd700.

Have a good time, and see you in the next one!
- Timo

This ought to be a relatively simple fix. Split by \n> instead of >, and handle the very first blob differently, because it now starts with a > still left in it. ↩︎
We're ignoring the possibility of adding comments to the beginning of the file, or anywhere in the file, really. ↩︎