Profile-guided optimization for the kernel

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
September 3, 2020

LPC

One of the many unfortunate consequences of the Covid-19 pandemic was the cancellation of the 2020 GNU Tools Cauldron. That loss turned out to be a gain for the Linux Plumbers Conference, which was able to add a GNU Tools track to host many of the discussions that would have otherwise occurred at Cauldron. In that track, Ian Bearman presented his group's work using profile-guided optimization with the Linux kernel. This technique, which he often referred to as "pogo", is not straightforward to apply to the kernel, but the benefits would appear to justify the effort.

Bearman is the leader of Microsoft's GNU/Linux development-tools team, which is charged with supporting those tools for the rest of the company. The team's responsibilities include ensuring the correctness, performance, and security of those tools (and the programs generated by them). Once upon a time, the idea of Microsoft having a GNU tools team would have raised eyebrows. Now, he said, about half of the instances in the Microsoft cloud are running Linux, making Linux a big deal for the company; it is thus not surprising that the company's cloud group is his team's biggest customer.

There was recently, he said, an internal customer working on a new Linux-based service that asked his team for performance help. After some brainstorming, the group concluded that this would be a good opportunity to use profile-guided optimization; the customer would have control of the whole machine running the service and was willing to build a custom kernel, making it possible to chase performance gains at any level of the system. But there was a small problem in that the customer was unable to provide any code to allow workload-specific testing.

Optimization techniques

From there, Bearman detoured into a "primer" on a pair of advanced optimization techniques. Profile-guided optimization is a technique where the compiler can be told how to optimize a program based on observations of its run-time performance. In any given program, not all parts will be executed as often as others; some parts, indeed, may not be run at all. Using profile data, the compiler can separate off the rarely used code and optimize its compilation for space. Hot code, instead, can be fully optimized and allowed to take more space in the process. Hot code and data can be laid out next to each other in the address space. The result is better performance overall, greater locality, better use of the translation lookaside buffer (TLB), and less disk I/O for paging.

Link-time optimization is a different technique. Normally, the compiler only sees one file at a time, and is thus only able to optimize code within that one file. The linker then assembles the results of multiple compilation steps into the final program. Link-time optimization works by allowing the compiler to process the entire program at once, delaying the optimization and code-generation steps until all of the pieces are available. The result can be significant performance improvements.

The two techniques can work together, he said, with impressive results. Some work to optimize a SPEC benchmark yielded a 5% performance gain with link-time optimization, and a 20% gain when profile-guided optimization was added as well. That is fine for a standalone application, but Bearman wanted to apply these techniques to the kernel, where few have dared to tread. Some digging around turned up two papers published by Pengfei Yuan (and collaborators) in 2014 and 2015; the latter claimed an average speedup of 8%. So the technique seemed worth a try.

Optimizing the kernel

The work was done on an Ubuntu 19.10 system, using the toolchain shipped with the distribution. Link-time optimization was not entirely straightforward to set up; in the end, some assistance from Andi Kleen, who has been working on kernel link-time optimization for years, was necessary. Getting profile-guided optimization working was relatively easy, instead, just requiring some trial-and-error work.

The team proceeded by instrumenting the kernel, then running it with various workloads of interest. The kernel supports profiling with gcov; that provided much of the information that was needed. He cautioned that anybody wanting to repeat this work should take care to turn off the profiling options and rebuild the kernel after the data has been gathered, or the result will not be as optimized as one might like — words that sounded like the voice of experience. Getting the profile data into the compiler was a bit challenging; GCC expects it to be in a specific location with a complicated name. One file crashed the compiler, and various "other glitches" were encountered as well.

Much of the testing was done with the Redis database on a 5.3 kernel. Just building the kernel with the -O3 option turned out to make performance worse. The kernel that had been optimized with link-time and profile-guided optimization, though, outperformed the standard kernel by 2-4% on all but one test (that last test regressed performance by about 0.5%). This, he said, was an impressive performance gain, especially since Redis doesn't actually spend all that much time in the kernel.

Bearman's conclusion from this work is that use of these techniques with the kernel is worth the trouble. Windows relies heavily on profile-guided optimization with its kernel, he said, and gets 5-20% performance improvements in return. Linux could perhaps get results of this magnitude as well. There is a "cyclic dependency" that is inhibiting the use of these tools with the Linux kernel; profile-guided optimization is not being heavily used, so people don't see the value in it. That results in compiler developers not putting in the effort to make it work better, so it remains unused. If more developers were to put in the effort to apply profile-guided optimization, perhaps that cycle could be reversed to the benefit of the entire community.

More information can be found in the slides from this presentation [PDF].

Index entries for this article
Kernel	Optimization tools
Conference	Linux Plumbers Conference/2020

(Log in to post comments)

Profile-guided optimization for the kernel

Posted Sep 4, 2020 5:06 UTC (Fri) by rbrito (guest, #66188) [Link]

I am eagerly waiting for kernels to be regularly compiled with link-time optimization at least, since the prospects of it generating smaller binaries could help a lot with booting more armv5 machines (the space for the kernel file is limited when loading it with the bootloader).

If the kernel also happens to use fewer bytes in memory, that is a very nice side effect that could be left to userspace in memory constrained machines, so much better. In fact, this would also help with the kernel that phones use (I'm thinking of lineageos here). A leaner kernel would mean less thrashing, which can only be a good thing. I hope that Apps that use NDK can also switch to LTO in the relatively near future.

Having distribution kernels with PGO for regular computers enabled seems to also be very nice. I really, really hope that we're not far from that being the bread-and-butter of kernel compilation.

Also talking about compiler optimizations, in the last few days I read about a new approach that was just merged in (see https://github.com/llvm/llvm-project/commit/94faadac) LLVM called machine function splitter. It seems to be roughly based on the observation that not all hot functions have all their parts hot. It would be lovely to get something similar in GCC.

In fact, if you think about it, many of the optimizations (like reducing the code for little machines like armv5) may also help significantly the code for big, compute-intensive, cloud applications and, so, everyone would benefit from that...

Profile-guided optimization for the kernel

Posted Sep 4, 2020 22:09 UTC (Fri) by nivedita76 (guest, #121790) [Link]

Don't know any details, but the LLVM pass sounds similar to gcc's -freorder-blocks-and-partition

In addition to reordering basic blocks in the compiled function, in order to reduce number of taken branches, partitions hot and cold basic blocks into separate sections of the assembly and .o files, to improve paging and cache locality performance.

Profile-guided optimization for the kernel

Posted Sep 10, 2020 17:49 UTC (Thu) by ndesaulniers (subscriber, #110768) [Link]

We've reached out to Ian about collaborating on LTO integration; we're in the process of upstreaming LTO patches for LLVM and need help from the GCC side to test and sort out what issues they might run into: https://lore.kernel.org/lkml/CAKwvOdkbkvXdfXLzTNOj8m8_YWj...

For example, LTO has exposed a handful of compiler bugs. Most recently: https://bugs.llvm.org/show_bug.cgi?id=47479.