November 10, 2018 · grant

Rakudo Perl 6 performance analysis tooling: Progress Report

I have started properly working on the grant later than I had hoped, and have not been able to invest as much time per week as I had initially planned. However, recently more and more parts of the tool have come together to form a useful product.

Recently, core developer Stefan 'nine' Seifert has been able to make good use of the tool to assist in analyzing and optimizing the performance of a re-write of MoarVM's bytecode compiler in NQP code. Here are quotes from Stefan on IRC, along with links to the public IRC logs:

https://colabti.org/irclogger/irclogger_log/perl6-dev?date=2018-10-09#l56

I now managed to get a profile of the MAST stage using a patched --profile-stage, but the profile is too large even for the QT frontend :/
timotimo++ # moarperf works beautifully with this huge profile :)

https://colabti.org/irclogger/irclogger_log/perl6-dev?date=2018-10-23#l66

timotimo++ # the profiler is indespensable in getting nqp-mbc merge worthy.

Below you can find a copy of the "Deliverables and Inchstones" section of the original grant proposal with explanations of the current status/progress and any additional features I've been able to implement, along with a couple of screenshots to illustrate what I mean.

The "web frontend for heap snapshot analyzer" and "bunch of examples" sections have had their sub-headings removed, because work on that area has not yet started.

Deliverables and Inchstones

A blog with progress reports.

There have been 6 reports, mostly due to a long starting delay and some additional intervention from Life Itself:

A web frontend for the heap snapshot analyzer

I have pushed this off to the future, opting to prioritize the instrumented profiler frontend instead.

A new web frontend for the instrumented profiler

  • Communication between the web frontend and a backend for data queries
    Some things are currently delivered via a "status message" websocket, others via GET queries. I might move a few of these from one mode to another, for example to make large results or multi-part results arrive in chunks, so that the UI may already display something before the entire data is available.

Selection_096

  • Overview page with at-a-glance metrics for quick feedback for code changes
    Just like the previous frontend, this page shows a whole bunch of high-level metrics about the program in general. These let you easily spot very big differences between two versions of the same code, or between the same code run by two different versions of rakudo.
    The Threads section shows the total run time of the profiled code, and how long the specializer has been busy, but on top of that it also displays the number of threads still active when profiling finished, and also when these threads were started relative to the thread that initiated profiling.
    The GC section has the total time spent running the GC, how often it ran minor and major collections, and the minimum, maximum, and average time of minor vs major collections.
    The Call Frames section is equivalent to the old profiler's Call Frames part, except that it formats the numbers with separators for thousands and such, which makes it a lot more readable to my eyes.
    The Dynamic Optimization section is also equivalent to the old profiler's Dynamic Optimization section.

Selection_090

  • Call Graph explorer
    The Call Graph Explorer is implemented and gives access to all threads that have run code. Navigating forwards is possible via a table of callees, and navigating backwards is possible via a breadcrumbs display, a dedicated "up" button near the callee list, and using the browser's back button or history. Likewise, navigating "forwards" is also possible using the browser's forwards button.
    • Robust forwards/backwards navigation
      (see above)
    • Icicle Graph with acceptable performance
      I have not yet started this.
    • Search function for Call Graph Explorer
      There is now a search field above the list of callees. Entering parts of a routine's name will put a little right-pointing hand icon in front of the callees that directly or indirectly call a routine with that name.

Selection_091

  • Routines List
    The Routines List ("overview") is implemented and offers a good range of information on routines seen throughout the profile. In order to not make the browser too laggy, it initially only loads the first 100 entries, along with a button that lets the user request 100 more.
    • Sortable by every sensible column
      Currently some sorting options are not present: Sorting by "inclusive time per call" or "exclusive time per call" would probably be the most interesting, followed perhaps by "percentage jitted".
    • Groupable by filename
      I have not yet started on this. It's a low priority at the moment, unless someone asks to get it sooner.
    • Expand a routine to see its callers or callees
      Expanding a routine offers tabs with callers, callees, allocations, and "paths" (see next point)

Selection_092

  • … with links directly to the call graph explorer
    Since routines in the routine list encompass any amount of different places in the call graph, having just one link to the call graph explorer doesn't make sense. Instead, there is a tab that shows all paths from thread entries to the given routine. All of the entries in these paths are direct links into the call graph explorer.
  • Bonus Features delivered
    Clicking on a little icon next to a routine's name will jump to and highlight the routine in the routine list itself.

Selection_093

  • Allocation explorer
    The Allocation Explorer is implemented. It's very simple, but definitely useful. On top of giving you a list of all routines that allocate any given type, you can immediately look at a "paths" view like the Routines List has. In the future, it will also point out how many allocations came from which parts of the tree.
    There's not yet anything to graphically show how much the types are "worth" in relation to each other, but compared to the previous profiler, it shows not only the number of allocations, but also the size of each object.
    • Show what routines allocate a given class
      Done. Includes links directly to the routine overview list.
    • Expand a routine to see which of its callers is responsible for what amount of allocations
      This feature is invaluable for situations where the allocating routine for a type is always just "method new". You can get more detailed information already using the "paths" view. Soon it will also tell you which paths are responsible for what fraction of total allocations.
    • Link to the heap snapshot explorer for any given type
      Since the heap snapshot explorer is going to be built in the second part of the grant, this is not yet applicable.
    • Bonus Features delivered:
      • Display type sizes
        The display of types has been enhanced to show not only total number of allocations, but also size per object, size in total, and a hint for types that have varying amounts of extra data per object, like an array or hash for example.
      • Link to Docs
        Types that are from the core setting and don't look like internal classes have links to the documentation website next to them.
  • OSR and Deopt explorer
    This part is currently completely missing.
    • Show routines that have been OSR'd (On-Stack-Replaced by the optimizer)
    • Again, allowing to expand routines to see their callers
    • Expose a lot more information about deopts - which types are involved, for example
    • Explain in clear terms what OSR and Deopt mean

Selection_094

  • GC explorer
    This part has been the most "interesting" to work on. Changes to moarvm had to be made, and a concept for displaying GC behavior across multiple threads was needed.
    • Figure out a good way to present information from multi-threaded programs here
      Currently, every entry in the "GC runs" table shows which threads (by ID) have participated in GC-ing, and expanding each row gives you timing information for each individual thread - specifically how long other threads took to join in, and how long each took.
    • Expose time-of-collection info to profiler
      Every row in the GC table has a "since start" and "since previous" entry.

Selection_095

  • Show the time and duration of collections, how much data was kept, promoted to the old generation, or deleted, and whether the collection was a minor or major one
    This is exposed in the expanded GC table rows.
  • Filterable to show major collections only
    The user can choose between "major only", "minor and major", or "minor only".
  • Compactly graph collection times and kept/promoted/deleted amounts
    There's bar charts at the very top for "time taken" and "time since start", and the choice of "combined", "split", and "relativized" graphs for the amounts of nursery data promoted/kept/freed of each gc run.

User-facing documentation on using the profiling facilities and interpreting the data

This has not yet happened, except for the blog post for the release that explains how to run the profiler, and the README.

Build a bunch of examples that show different performance issues and describe how to figure them out

There are currently big changes going on in MoarVM's optimizer relating to removing unnecessary boxing of intermediate values in calculations and such, which may make any examples moot real soon. I've delayed this part of the work a bit for that reason.

In Summary ...

I was positively surprised when I last opened the original tasks list and saw what was already done, and what could be finished with just a little bit of work!

I hope you'll give moarperf a try some time soon and let me know what you think! Here's a link to the project on github: MoarPerf project on github

Thanks for staying around and reading my blog posts :)
  - Timo

A Perl Programming Language blog.