Profiling
Core Technology
We all want our programs to run fast. With profiling, you can locate the code that is slowing you down and fix it.
While hardware becomes faster and faster, the software it runs becomes slower and slower. Although this is a joke, it carries a bit of truth. Quite often, the software uses suboptimal algorithms, introduces costly abstractions, or misses hardware caches. This is okay most of the time, because hardware is forgiving and has enough capacity for many tasks. The rest of the time, you want your code to perform as fast as possible.
Optimizing a program is not the same as rewriting it from scratch the right way. The majority of code is generally not in the "hot path." It's executed only once in awhile during the program's lifetime, and investing optimization efforts into these parts will never pay off. So, you need a tool to tell you which code paths are consuming most of the CPU time.
A technique called "profiling" comes into play here. You build an execution profile of the program (hence the name) that tells you which functions the program runs and how much time they take; then, you look at the profile and decide what you can do about it. While the last step is a bit of an art, the first step just requires a tool that is essential to any developer. Luckily, Linux offers several such tools.
Ye Olde gprof
In a nutshell, any profiler records functions it sees on the CPU and how long they last. A naive approach would be to inject a hook at the beginning and the end of every function. An "entry hook" can remember the function's name and the timestamp, so the "exit hook" can calculate the time elapsed with a mere subtraction.
Simple as it sounds, it won't work in a multiprocess operating system like Linux, because the kernel may preempt the function being executed virtually anywhere, and the wall clock difference between entry and exit times could be very different from the CPU time. Therefore, you might end up optimizing a function that was really preempted by a higher priority process. That's not good.
A real-world approach is to sample the program at predefined intervals of virtual execution time. Virtual execution time stops when the process is de-scheduled, so if you multiply the sample interval by the number of samples, you get the estimate of how long the function was running. You don't want to sample so often it affects performance, and you don't want the interval to be unnecessarily large, because it is the natural upper bound of the estimation error. Many Unix systems introduce a dedicated profil(2)
system call for such sampling. In Linux, it's a library function [1] that runs on top of setitimer(2)
, thus having more overhead than it should. But you don't call profil()
directly. GCC, glibc, and friends wrap these nuts and bolts to build a high-level tool: gprof
.
I have never seen a Linux distribution that provides a build-essential
equivalent but no gprof
, so it is safe to assume you'll find gprof
in the repositories, typically within binutils
. In fact, gprof
is just a utility to postprocess and display collected profiles. Before you can use it, you need to build your code with profiling instrumentation enabled.
To do this, pass the -pg
switch to GCC to add some instrumentation (see the "A Word on Function Instrumentation" box): an initialization function, __monstartup()
, and the corresponding atexit
handler [2], _mcleanup()
. It also embeds a call to mcount()
in every function in your program. __monstartup()
sets things up and calls profil()
; mcount
records the call stack. For this reason, you can't use -fomit-frame-pointer
together with -pg
. When your program stops, the _mcleanup()
function calls write_gmon()
, which saves the call graph, the program counter histogram that profil()
collects, and some other data in a file. Typically, it's named gmon.out
.
A Word on Function Instrumentation
The -pg
switch of gcc
is an example of a technique known as "function instrumentation," and it has many more applications, really.
You know that C++ programs can throw exceptions. Other languages, such as Python, do this as well, but, when it happens, you get a neat backtrace that you can use to learn the cause of the problem. What if you could do a similar thing with C++?
Function instrumentation is the answer. When you pass -finstrument-functions
to GCC, it embeds two calls to each function. When a function is entered, __cyg_profile_func_enter()
is called, and __cyg_profile_func_exit()
runs on exit. Both accept a pointer to the function being called and a pointer to the site that called it. It is straightforward to use these two hooks to record the call chain. Then, if an exception happens, you can readily construct the backtrace, as in Python.
The best part is that you don't need to implement it yourself, because ready-to-use libraries take care of everything. For instance, try libcsdbg
[4].
For postprocessing, you run gprof
. The tool accepts many command-line switches [3], but generally you run it as
gprof foo gmon.out
where foo
is the name of the binary you were profiling. This yields a lengthy output that looks similar to Figure 1. This output spans two main parts: the flat profile and the call graph. To show the flat profile only, run gprof -p
; to display the call graph, run gprof -q
. Both accept a symbol (e.g., function name) to use as a filter. You can also make gprof
less verbose with -b
: This way, you get data only and no explanatory text (Figure 2).
The flat profile shows which functions consumed CPU time. The most important piece of information is in the self column, which tells how much time the function spent executing. The cumulative column is the running sum, so if you are interested in the notorious 20 percent of functions that run 80 percent of the time, just cut the output after 80-something in this column. You also get the number of calls for each function (calls) and the average number of milliseconds spent in the function itself (self) or together with its children (total) per call.
The call graph tells you how these functions are reached. main()
should be the top function, accounting for 100 percent of CPU time, because this is the function your program executes at the top level. Below it should be the functions that main()
calls, along with the time accounted for them (self) and their children. It is possible that several paths lead to the same function, and it makes sense to optimize the path that is on the top of this list.
Profiling System-Wide
gprof
is a venerable tool, and an even older technology, books, describe it in detail. gprof
is still relevant today, but there are other options as well.
The main pain point in gprof
is that you need to instrument a program before you can profile it. Instrumented programs are generally slower, but that's not the point. The point is you need to recompile your code, which makes it unsuitable for run-time analysis: You can't just attach to the process and see where it spends its time. You are instrumenting just one process, so if the system is performing poorly, gprof
won't tell you the reason. Finally, gprof
doesn't help when it comes to kernel space code.
This doesn't mean gprof
is bad. It was designed to profile userspace code during development, and it handles this job quite well. For everything else, you just need a different tool. One such tool is perf
[5], which comes from the Linux kernel team. perf
relies on hardware support, such as performance counters, to monitor fairly tricky events. It will help you understand whether your code is underperforming (e.g., because it has a high L2 cache miss rate). But perf
is also useful for "ordinary" profiling, and it can do it system-wide or per-process. It also profiles both in the userspace and in kernel space, but there are some prerequisites. First, your code should be compiled with frame pointer support. This is likely the case, as I explain in the "Your Friendly Neighbor the Frame Pointer" box. perf
claims to support programs built with -fomit-frame-pointer
as well, but I have never succeeded in getting a decent profile with other methods, such as DWARF. Your mileage may vary. Second, you'll need debug symbols for the code you profile and the libraries it relies on (e.g., glibc). Luckily, many distributions ship debug symbols in separate packages these days, so you can add them on an as-needed basis.
The perf
tool (actually, a set of tools) should be in your distribution's repository. It comes from kernel sources and is naturally tied to the kernel version your Linux box runs. On Ubuntu, install linux-tools-X.Y.Z-generic
or similar, if you are running a non-generic (e.g., lowlatency
) kernel. Here, X.Y.Z
is your kernel's version.
As I mentioned earlier, perf
builds around performance events. There could be quite a few of them, depending on your hardware and kernel options, because some events (e.g., tracepoints or context switches) are software based. You can get a complete list of events supported in your system with sudo perf list
. Note the use of sudo
: perf list
would run without superuser privileges, but the number of events it reports will be limited. Many perf
subcommands understand the -e
or --events
switch, which tells which event you are interested in. It defaults to cycles
, which counts CPU cycles and is a natural fit for a profiling task.
For system-wide performance analysis, perf top
should be your first stop. It displays an ncurses-based top
-like interface (Figure 3). The entries with higher overhead (i.e., executed most often unless you set otherwise with -e
) come first. Each entry corresponds to a symbol – usually, a function. With debug symbols installed, you see the functions' names; otherwise, you get hexadecimals. Humans are not good at reading hexadecimals, which is why it's important to have debug symbols installed. Note that the Linux kernel is also listed as .
Each entry is attributed several properties, such as comm
(a command being executed) and DSO
. The later stands for Dynamic Shared Object, and if your binary is a dynamic executable (many are), the DSO
is where you get the library name from which the symbol comes. Note the binary itself is also a DSO
.
perf
defaults to function-level granularity. For a higher level overview, you can sort the output by comm
and then by DSO
with:
sudo perf top --sort comm,dso
This way, you get process-level details.
Zoom into Processes
After you have learned which process is the culprit, it's time to drill it down. The default function-level granularity works well here, but you can add "another dimension" with -g
, which enables call graph collection. There are some ways to fine-tune this [6]. You get a new column, Children, which accounts for the time spent in the functions called from the highlighted entry, but the main effect is that you can zoom into functions if you press Enter. While you can employ this to monitor individual processes in perf top
(use -p
switch), there is another way around.
The perf record
command gathers the performance counters profile from a given command and saves it in a file, usually perf.dat
. Then you can postprocess this data the way you want: I'll show some options in a second. The command for gathering the profile can be specified on the command line, or you can again use -p
to sample given the process or processes. You specify a sampling frequency with -F
or --freq
, and it is a good idea not to use "beautiful" values, such as 10 or 100 here, because these numbers readily spring to mind if you just need "some frequency," so the code you profile may already have something that ticks at exactly the same interval. Instead of profiling the code, you will be profiling that timer's callbacks, which is not what you want.
Given all of these conditions, a typical perf record
invocation may look like:
sudo perf record -g -F 99 -a -p $PID -- sleep 10
The -a
tells perf
to sample across all CPUs. The sleep
command is just to set a duration for which you want perf
to collect the samples. Here, it's 10 seconds. At the end, -p
tells perf
which process you want to profile.
To format the output collected, you use perf report
(Figure 4). It shows the same ncurses-based text user interface you saw in perf top
. You can zoom in to the call graph with the Enter key, and other key bindings are available as well (press h). For instance, the Esc key zooms out, and pressing a annotates the symbol under the cursor: You get a disassembly of the function showing you which instructions are most stressed.
The most popular profile viewer is probably perf report
, but it's certainly not the only one. Hotspot [7], not to be confused with the Java VM, is a GUI counterpart that combines many features in a single Qt-based package (Figure 5), including built-in FlameGraphs (see the "Command of the Month" box) and more intuitive aggregation. Hotspot was released to the public while I was writing this column, so I didn't have much time to give it a try. If you did, please share your thoughts.
Command of the Month: flamegraph.pl
Coverage of perf
would be rather incomplete without mentioning FlameGraphs.
FlameGraph is the call graph visualization that makes finding hot paths in the code easier. With an appropriate palette, these visualizations look much like flames (Figure 6), hence the name.
FlameGraphs are two-dimensional. Along the horizontal axis, stack population (that is, functions called) is shown. It is sorted alphabetically and doesn't indicate a passage of time. Each function (or, more precisely, call stack frame) is represented as a rectangle, and the wider the rectangle, the more often the function appears in the profile. The vertical axis represents the call sequence: Functions at the bottom call those at the top, so the largest flame is what is seen on the CPU the most.
Brendan D. Gregg, who is the inventor of FlameGraph, also built a toolset to create FlameGraphs from perf.dat
files (and many other sources). It's available online [8]; the main piece of this toolset, flamegraph.pl
(yes, it's Perl), renders FlameGraphs as interactive SVG files. The README [8] provides all the details.
If you are serious about doing performance analysis in Linux, I highly recommend Brendan Gregg's other tutorials, blog posts, and tools available on his homepage [9].
Your Friendly Neighbor the Frame Pointer
You have already seen enough examples of how useful it could be to store the call chain that leads to the given point in the code. Frame pointer is a way to do it at the hardware level.
Frame pointer is just a CPU register. On x86, the BP/EBP/RBP registers serve this purpose. In code that uses frame pointers, each function typically begins with:
push %rbp mov %rsp, %rbp
That is, %rbp
always points to the current function's stack frame, and the previous function's stack frame pointer is just above it (remember that on x86, the stack grows down to the lower addresses). When the function returns, it does the reverse:
mov %rbp, %rsp pop %rbp
There are even dedicated instructions for this: enter
and leave
, respectively.
Frame pointers take one register to store (and a few CPU cycles to maintain), so historically, compilers have been providing a way to disable this feature. Using it is often discouraged now. Registers aren't as scarce as they used to be in the 32-bit world, and the ability to reconstruct the call stack usually outweighs the pros of such optimization.
Infos
- profil(3) man page: https://linux.die.net/man/3/profil
- atexit(3) man page: https://linux.die.net/man/3/atexit
- gprof(1) man page: https://linux.die.net/man/1/gprof
- libcsdbg: http://libcsdbg.sourceforge.net/
- perf wiki: https://perf.wiki.kernel.org/index.php/Main_Page
- perf-record(1) man page: https://linux.die.net/man/1/perf-record
- hotspot homepage: https://www.kdab.com/hotspot-gui-linux-perf-profiler/
- FlameGraph homepage: https://github.com/brendangregg/FlameGraph
- Brendan Gregg's homepage: http://www.brendangregg.com/
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.