Good, that's the click-baity title out of the way. Sorry for taking such a long time to write again! There really has been everything going on.

To get back into blogging, I've decided to quickly write about a change I made some time ago already.

This change was for the "instrumented profiler", i.e. the one that will at run-time change all the code of the user's program, in order to measure execution times and count up calls and allocations.

In order to get everything right, the instrumented profiler keeps an entire call graph in memory. If you haven't seen something like it yet, imagine taking stack traces at every point in your program's life, and all these stack traces put together make all the paths in the tree that point at the root.

This means, among other things, that the same function can come up multiple times. With recursion, the same function can in fact come up a few hundred times "in a row". In general, if your call tree can become both deep and wide, you can end up with a whole truckload of nodes in your tree.

Autumn tree branches
Photo by Gabriel Garcia Marengo / Unsplash

Is it a bad thing to have many nodes? Of course, it uses up memory. Only a single path on the tree is ever interesting at any one moment, though. Memory that's not read from or written to is not quite as "expensive". It never has to go into the CPU cache, and is even free to be swapped out to disk and/or compressed. But hold on, is this really actually the case?

It turns out that when you're compiling the Core Setting, which is a code file almost 2½ megabytes big with about 71½ thousand lines, and you're profiling during the parsing process, the tree gets enormous. At the same time, the parsing process slows to a crawl. What on earth is wrong here?

Well, looking at what MoarVM spends most of its time doing while the profiler runs gives you a good hint: It's spending almost all of its time going through the entirety of the tree for garbage collection purposes. Why would it do that, you ask? Well, in order to count allocated objects at every node, you have to match the count with the type you're allocating, and that means you need to hold on to a pointer to the type, and that in turn has to be kept up to date if anything moves (which the GC does to recently-born things) and to make sure types aren't considered unused and thrown out.

That's bad, right? Isn't there anything we can do? Well, we have to know at every node which counter belongs to which type, and we need to give all the types we have to the garbage collector to manage. But nothing forces us to have the types right next to the counter. And that's already the solution to the problem:

Holding on to all types is now the job of a little array kept just once per tree, and next to every counter there's just a little number that tells you where in the array to look.

This increases the cost of recording an allocation, as you'll now have to go to a separate memory location to match types to counters. On the other hand, the "little number" can be much smaller than before, and that saves memory in every node of the tree.

More importantly, the time cost of going through the profiler data is now independent of how big the tree is, since the individual nodes don't have to be looked at at all.

With a task as big as parsing the core setting, which is where almost every type, exception, operator, or sub lives, the difference is a factor of at least a thousand. Well, to be honest I didn't actually calculate the difference, but I'm sure it's somewhere between 100x faster and 10000x faster, and going from "ten microseconds per tree node" to "ten microseconds per tree" isn't a matter of a single factor increase, it's a complexity improvement from O(n) to O(1). As long as you can find a bigger tree, you can come up with a higher improvement factor. Very useful for writing that blog post you've always wanted to put at the center of a heated discussion about dishonest article titles!

Anyway, on testing my patch, esteemed colleague MasterDuke had this to say on IRC:

timotimo: hot damn, what did you do?!?! stage parse only took almost twice as long (i.e., 60s instead of the normal 37s) instead of the 930s last time i did the profile

(psst, don't check what 930 divided by 60 is, or else you'll expose my blog post title for the fraud that it is!)

Well, that's already all I had for this post. Thanks for your attention, stay safe, wear a mask (if by the time you're reading this the covid19 pandemic is still A Thing, or maybe something new has come up), and stay safe!

Pink face mask held in display
Photo by Macau Photo Agency / Unsplash