Tangentially related work
Hi there, it's already been three weeks since my first blog post on my TPF grant work. In between then and now the nice folks over at Edument made public the work I've been doing on the side for a couple of months. Fortunately, my grant work benefits a whole lot from this as well. Being able to debug (set breakpoints, inspect variable contents, single-step execution) the profile dumping code in nqp (re-used in Rakudo), as well as the heap snapshot loading code in
App::MoarVM::HeapAnalyzer::Model lets me more easily figure out why things might be wrong or even crashing.
Since Edument's product is Cro, the reactive framework for implementing and consuming microservices, I use simple Cro applications as test subjects for the profilers, as well as the debugger.
Yesterday I took the first close look at the Cro app that powers the new profiler frontends, by running it under the heap profiler. I was rather surprised to find that even while no requests were being made, the heap snapshot file kept growing at a hundred megabytes every few seconds. Something was clearly amiss.
To understand why this happens you must know that MoarVM will take a heap snapshot after every GC run. That means something must be causing the GC to run frequently even if no actual work is being done.
Fortunately, I know that Rakudo's ThreadPoolScheduler has a built-in supervisor that has an eye on the performance of the thread pool. It runs on its own thread and wakes up a hundred times every second.
My recent profiler work to make multi-threaded applications run properly under the regular profiler let me have a closer look at what was being allocated. Turns out a lot of objects related to iterating over a Range object were being created. A single function was using range iteration, but that accounted for a huge chunk of allocations. Looking at what functions allocate the different types of objects, you can see that pretty much 100% of all Scalar allocations were caused by iterating over a Range.
Scalars may only hold 3 pointers in them on top of the common object header that's 16 bytes big, but it surely adds up! (measurement taken over a 60 second period)
So just changing the
for ^worker-list.elems into an equivalent
loop loop got allocations down significantly.
There was, of course, more left to do. The math we were doing to figure out how active the process has been since we last woke up (including a moving average) caused some boxing and unboxing. A call to a helper method was allocating empty hashes for named arguments every time (something we can often optimize away completely, but in this case we couldn't be sure that it'd be safe). And finally, the
getrusage op was creating a fresh int array every time.
I was initially reluctant to make the code less readable in the interest of performance. However, the supervisor allocating absolutely nothing was almost achieved. Lizmat inlined the getrusage-total helper sub that caused boxing and unboxing on every call, and I decided to inline the prod-affinity-workers helper method as well – this one was only used in a single place, anyway.
The last piece of the puzzle was the
getrusage op that allocated integer arrays every time it was called. To get the last drop of performance out of the supervisor thread, I changed the implementation of
nqp::getrusage to take an already allocated array.
After all this work, the ThreadPoolScheduler will not allocate anything at all if nothing is happening.
I hope I have given you a little insight into what working on performance can look like.
Now that I've shaved this yak, I can properly profile and analyze Cro applications and anything that runs tasks with the
start keyword, like the backend of the heap analyzer!
I hope to see you back on my blog when the next blog post hits the 'net!