Heap Snapshot Profiler

Intermediate Progress Report: Heap Snapshots

I have long postponed working on the heap snapshot profiler part of the Performance Analysis grant. Now there is a bit of progress to report on and some detailed look at the heap snapshot file format.

Hello dear readers,

this time I don't have something finished to show off, but nevertheless I can tell you all about the work I've recently started.

The very first post on this blog already had a little bit about the heap snapshot profiler. It was about introducing a binary format to the heap snapshot profiler so that snapshots can be written out immediately after they were taken and by moarvm itself rather than by NQP code. This also made it possible to load snapshot data faster: the files contain an index that allows a big chunk of data to be split up exactly in the middle and then read and decoded by two threads in parallel.

The new format also resulted in much smaller output files, and of course reduced memory usage of turning on the profiler while running perl6 code. However, since it still captures one heap snapshot every time the GC runs, every ordinary program that runs for longer than a few minutes will accumulate quite an amount of data. Heap snapshot files (which are actually collections of multiple snapshots) can very easily outgrow a gigabyte. There would have to be another change.

green C-camp lot
Photo by Waldemar Brandt / Unsplash

Enter Compression

The new format already contained the simplest thing that you could call compression. Instead of simply writing every record to the file as it comes, records that have smaller numbers would be stored with a shorter representation. This saved a lot of space already, but not nearly as much as off-the-shelf compression techniques would.

There had to be another way to get compression than just coming up with my own compression scheme! Well, obviously I could have just implemented something that already exists. However, at the time I was discouraged by the specific requirements of the heap snapshot analyzer - the tool that reads the files to let the user interactively explore the data within it:

Reading the file has to be fast
Any snapshot in the file must be accessible equally quickly and in any order
In order to get decent speed out of the reading process, it had to be multithreadable.

Normally, compression formats are not built to support easily seeking to any given spot in the uncompressed data. There was of course the possibility to compress each snapshot individually, but that would mean a whole snapshot could either only be read in with a single thread, or the compression would have to go through the whole blob and when the first splittable piece was decompressed, a thread could go off and parse it. I'm not entirely sure why I didn't go with that, perhaps I just didn't think of it back then. After all, it's already been more than a year, and my brain compresses data by forgetting stuff.

Anyway, recently I decided I'd try a regular compression format for the new moarvm heap snapshot file format. There's already a Perl 6 module named Compress::Zlib, which I first wanted to use. Writing the data out from moarvm was trivial once I linked it to zlib. Just replace fopen with gzopen, fwrite with gzwrite, fclose with gzclose and you're almost done! The compression ratio wasn't too shabby either.

When I mentioned this in the #moarvm channel on IRC, I was asked why I use zlib instead of zstd. After all, zstd usually (or always?) outperforms zlib in both compression/decompression speed and output size. The only answer I had for that was that I hadn't used the zstd C library yet, and there was not yet a Perl 6 module for it.

Figuring out zstd didn't go as smoothly as zlib, not by a long shot. But first I'd like to explain how I solved the issue of reading the file with multiple threads.

Restructuring the data

In the current binary format, there are areas for different kinds of objects that occur once per snapshot. Those are collectables and references. On top of that there are objects that are shared across snapshots: Strings that are referenced from all the other kinds of objects (for example filenames, or descriptions for references like "hashtable entry"), static frames (made up of a name like "push", an ID, a filename, and a line number), and types (made up of a repr name like VMArray, P6opaque, or CStruct and a type name like BagHash or Regex).

That resulted in a file format that has one object after the other in the given area. The heap snapshot analyzer itself then goes through these areas and splits the individual values apart, then shoves them off into a queue for another thread to actually store. Storage inside the heap analyzer consists of one array for each part of these objects. For example, there is one array of integers for all the descriptions and one array of integers for all the target collectables. The main benefit of that is not having to go through all the objects when the garbage collector runs.

The new format on the other hand puts every value of each attribute in one long string before continuing with the next attribute.

Here's how the data for static frames was laid out in the file in the previous version:

"sframes", count, name 1, uuid 1, file 1, line 1, name 2, uuid 2, file 2, line 2, name 3, uuid 3, file 3, line 3, … index data

The count at the beginning tells us how many entries we should be expecting. For collectables, types, reprs, and static frames this gives us the exact number of bytes to look for, too, since every entry has the same size. References on the other hand have a simple "compression" applied to them, which doesn't allow us to just figure out the total size by knowing the count. To offset this, the total size lives at the end in a place that can easily be found by the parser. Strings are also variable in length, but there's only a few hundred of them usually. References take up the most space in total; having four to five times as many references as there are collectables is pretty normal.

Here's how the same static frame data is laid out in the upcoming format:

"names", length, zstd(name 1, name 2, name 3, …), index data, "uuids", length, zstd(uuid 1, uuid 2, uuid 3, …), index data, "files", length, zstd(file 1, file 2, file 3, …), index data, "lines", length, zstd(line 1, line 2, line 3, …), index data

As you can see, the values for each attribute now live in the same space. Each attribute blob is compressed individually, each has a little piece of index data at the end and a length field at the start. The length field is actually supposed to hold the total size of the compressed blob, but if the heap snapshot is being output to a destination that doesn't support seeking (moving back in the file and overwriting an earlier piece of data) we'll just leave it zeroed out and rely on zstd's format being able to tell when one blob ends.

There are some benefits to this approach that I'd like to point out:

The compression algorithm can perhaps figure patterns out better, because more "random" data isn't interspersed to break up patterns. I have not actually tested this yet, though!
The individual attributes can be uncompressed and stored away in parallel, since there's one count shared between many attributes. Compare this to decoding 500 thousand collectables on one thread and 1.7 million on another.
Need to add another attribute for some reason? You can just put it next to the other blobs, give it a new name that the older versions don't know and they'll just skip it!

The last point, in fact, will let me put some extra fun into the files. First of all, I currently start the files with a "plain text comment" that explains what kind of file it is and how it's structured internally. That way, if someone stumbles upon this file in fifty years, they can get started finding out the contents right away!

On a perhaps more serious note, I'll put in summaries of each snapshot that MoarVM itself can just already generate while it's writing out the snapshot itself. Not only things like "total number of collectables", but also "top 25 types by size, top 25 types by count". That will actually make the heapanalyzer not need to touch the actual data until the user is interested in more specific data, like a "top 100" or "give me a thousand objects of type 'Hash'".

On top of that, why not allow the user to "edit" heap snapshots, put some comments in before sharing it to others, or maybe "bookmarking" specific objects?

All of these things will be easier to do with the new format - that's the hope at least!

Did the compression work?

I didn't actually have the patience to exhaustively measure all the details, but here's a rough ratio for comparison: One dataset I've got results in a 1.1 gigabytes big file with the current binary format, a 147 megabytes big file when using gzip and a 99 megabytes big file using zstd (at the maximum "regular" compression level - I haven't checked yet if the cpu usage isn't prohibitive for this, though).

It seems like this is a viable way forward! Allowing capture to run 10x as long is a nice thing for sure.

What comes next?

The profiler view itself in Moarperf isn't done yet, of course. I may not put in more than the simplest of features if I start on the web frontend for the heap analyzer itself.

On the other hand, there's another task that's been waiting: Packaging moarperf for simpler usage. Recently we got support for a relocatable perl6 binary merged into master. That should make it possible to create an AppImage of moarperf. A docker container should also be relatively easy to build.

We'll see what my actual next steps will be - or will have been I suppose - when I post the next update!

Thanks for reading and take care
- Timo