Delays and Delights
Hi, my name is timotimo and I'm a Perl 6 developer. I've set up this blog to write reports on my TPF Grant.
Before the actual report starts, I'd like to issue an apology. In between my grant application and the grant being accepted I developed a bit of RSI that lasted for multiple months. I already anticipated that near the end of January a move would occupy me a bit, but I had no idea how stressful it would turn out to be, and how long afterwards it would keep me occupied.
I regret that it took me so long to actually get started. However, I've already got little bits to show for the first report of this grant!
Delight №1: Non-crashy multi-threaded profiles
Until now if you've added a signal handler for
Ctrl-C to your program, or used
shell, or even
Proc::Async or any async IO, rakudo or moarvm will have started an additional thread for you. Even if you don't necessarily realize it, this caused the profiler to either violently explode, or return useless or partial results.
Just these past few days I've made a bit of reliability work for the profiler available to everyone running rakudo from the latest source code. Now the profiler will only sometimes abort with the error message "profiler lost sequence" – a bug that I'm also going to chase down as part of the grant. On top of that, the HTML profiler frontend now shows timing and allocations from every thread.
Delight №2: Faster Heap Snapshots
N.B.: I have done the work discussed in this section before the grant actually started, but I think it's still interesting for you.
As you can tell from the – sadly mildly messed-up – grant application, I also work on a second profiler: The Heap Snapshot Profiler. It takes a snapshot of the structure of all objects in memory, and their connections. With this information, you can figure out what kinds of objects account for how much of your memory usage, and for what reason any given object is not being removed by the garbage collector.
If you've already tried out this profiler in the past, you may have noticed that it incurs a significant memory usage increase during run-time. After that, it takes a surprising amount of time to load in the heap snapshot analyzer. This changed noticeably when I implemented a new format for heap snapshots.
Until then, the format had one line (i.e.
\n delimited) per snapshot (i.e. for each GC run) and a json-encoded list of strings up front. The snapshots themselves were then a few lines with lists of integers.
This caused two little issues in terms of performance:
- Encoding the list of strings as JSON had to be done in NQP code, because of the somewhat complex escaping rules.
- Reading the heap snapshot requires a whole lot of string splitting and integer parsing, which is comparatively slow.
The binary-based format I replaced it with addresses both of these issues, and has a few extra features for speed purposes:
- The list of strings is written as a 64 bit integer for the string length followed by the string itself encoded in utf-8. At the very front is a single 64 bit integer that holds the list's size.
- Snapshot data is encoded with either 64 bit integers or a very simple variable-sized integer scheme. Each list stores its size at the beginning, too.
- The very end of the file contains an index with the sizes of all the sections, so that skipping to any point of interest is cheap.
Armed with the index at the end of the file, the binary format can be read by multiple threads at the same time, and that's exactly what the new Heap Snapshot Analyzer will do. In addition to the start positions of each section, there's also a pointer right into the middle of the one section that holds variable-sized entries. That way both halves can be read at the same time and cheaply combined to form the full result.
The "summary all" command loads every snapshot, measures the total size of recorded objects, how many objects there are, and how many connections exist. The result is displayed as a table. Running this against a 1.1 gigabyte big example snapshot with 44 snapshots takes about a minute, uses up 3⅓ CPU cores out of the 4 my machine has, and maxes out at 3 gigs of ram used. That's about 19 megabytes per second. It's not blazing fast, but it's decent.
Delight №3: Web Frontend
Sadly, there isn't anything to show for this yet, but I've started a Cro application that lets the user load up a heap snapshot file and load individual snapshots in it.
It doesn't seem like much at all, but the foundation on which the other bits will rest is laid now.
I intend for the next posts to have a bunch of tiny animated gif screencasts for you to enjoy!
Delight №4: Faster Profile File Writing
The SQL output will likely be the primary data storage format after my grant is done. Therefore I've taken first steps to optimize outputting the data. Right now it already writes out data about 30% faster during the actual writing stage. The patch needs a bit of polish, but then everyone can enjoy it!
The next thing I'd like to work towards is trying out the profiler with a diverse set of workloads: From hyper/race to a Cro application. That will give me a good idea of any pain points you'd encounter. For example, once the ThreadPoolScheduler's supervisor thread runs, it'll spend a majority of its time sleeping in between ticks. This shows up very prominently in the Routines list, to name one example.
If you have questions, suggestions, or comments, please feel free to ping me on IRC, or post comments to this blog using the disqus I've embedded below.
Thanks for reading; see you in the next one!