Instrumented Profiler

Always Wear Safety Equipment When Inline Scalaring

Allocating objects is expensive. The shorter an object's lifetime is, the less expensive it is in total. But what happens to objects that only live for a few instructions? This post shows off how "scalar replacement" looks in the profiler frontend.

MoarVM just recently got a nice optimization merged into the master branch. It's called "partial escape analysis" and comes with a specific optimization named "scalar replacement". In simple terms, it allows objects whose lifetime can be proven to end very soon to be created on the stack instead of in the garbage collector managed heap. More than just that, the "partial" aspect of the escape analysis even allows that when the object can escape out of our grasp, but will not always do so.

Allocation on the stack is much cheaper than allocation in the garbage collected heap, because freeing data off of the stack is as easy as leaving the function behind.

person wearing pair of green inline skates
Photo by A. Zuhri / Unsplash

Like a big portion of optimizations in MoarVM's run-time optimizer/specializer ("spesh"), this analysis and optimization usually relies on some prior inlining of code. That's where the pun in the post's title comes from.

This is a progress report for the profiler front-end, though. So the question I'd like to answer in this post is how the programmer would check if these optimizations are being utilized in their own code.

A Full Overview

The first thing the user gets to see in the profiler's frontend is a big overview page summarizing lots of results from all aspects of the program's run. Thread creations, garbage collector runs, frame allocations made unnecessary by inlining, etc.

That page just got a new entry on it, that sums up all allocations and also shows how many allocations have been made unnecessary by scalar replacement, along with a percentage. Here's an example screenshot:

Selection_150

That on its own is really only interesting if you've run a program twice and maybe changed the code in between to compare how allocation/optimization behavior changed in between. However, there's also more detail to be found on the Allocations page:

Per-Type and Per-Routine Allocations/Replacements

The "Allocations" tab already gave details on which types were allocated most, and which routines did most of the allocating for any given type. Now there is an additional column that gives the number of scalar replaced objects as well:

Selection_151

Here's a screenshot showing the details of the Num type expanded to display the routines:

Selection_152

Spesh Performance Overview

One major part of Spesh is its "speculative" optimizations. They have the benefit of allowing optimizations even when something can't be proven. When some assumption is broken, a deoptimization happens, which is effectively a jump from inside an optimized bit of code back to the same position in the unoptimized code. There's also "on-stack-replacement", which is effectively a jump from inside unoptimized code to the same position in optimized code. The details are, of course, more complicated than that.

How can you find out which routines in your program (in libraries you're calling, or the "core setting" of all builtin classes and routines) are affected by deoptimization or by OSR? There is now an extra tab in the profiler frontend that gives you the numbers:

Selection_153

This page also has the first big attempt at putting hopefully helpful text directly next to the data. Below the table there's these sections:

Specializer Performance

MoarVM comes with a dynamic code optimizer called "spesh". It makes your code faster by observing at run time which types are used where, which methods end up being called in certain situations where there are multiple potential candidates, and so on. This is called specialization, because it creates versions of the code that take shortcuts based on assumptions it made from the observed data.

Deoptimization

Assumptions, however, are there to be broken. Sometimes the optimized and specialized code finds that an assumption no longer holds. Parts of the specialized code that detect this are called "guards". When a guard detects a mismatch, the running code has to be switched from the optimized code back to the unoptimized code. This is called a "deoptimization", or "deopt" for short.

Deopts are a natural part of a program's life, and at low numbers they usually aren't a problem. For example, code that reads data from a file would read from a buffer most of the time, but at some point the buffer would be exhausted and new data would have to be fetched from the filesystem. This could mean a deopt.

If, however, the profiler points out a large amount of deopts, there could be an optimization opportunity.

On-Stack Replacement (OSR)

Regular optimization activates when a function is entered, but programs often have loops that run for a long time until the containing function is entered again.

On-Stack Replacement is used to handle cases like this. Every round of the loop in the unoptimized code will check if an optimized version can be entered. This has the additional effect that a deoptimization in such code can quickly lead back into optimized code.

Situations like these can cause high numbers of deopts along with high numbers of OSRs.

I'd be happy to hear your thoughts on how clear this text is to you, especially if you're not very knowledgeable about the topics discussed. Check github for the current version of this text - it should be at https://github.com/timo/moarperf/blob/master/frontend/profiler/components/SpeshOverview.jsx#L82 - and submit an issue to the github repo, or find me on IRC, on the perl6-users mailing list, on reddit, on mastodon/the fediverse, twitter, or where-ever.