Making FlameGraphs with Containerized Java
About a month ago, I had the pleasure of taking a tutorial led by the fantastic Brendan Gregg on creating FlameGraphs using the Linux perf
toolset. I recommend reading his many blog posts on the subject, but in short: while perf
is an excellent resource for debugging kernel and user space processes, FlameGraphs make the data even easier to consume.
Now, if the process you’re trying to profile is Java, there are some extra hoops to jump through, which Brendan has also detailed online.
But if the Java process is in a container, it’s even more annoying. That’s where this post comes in.
Some context
As explained in Brendan’s blog post here, perf
doesn’t work out of the box on Java, because Java doesn’t automatically expose stacks and method names. Running perf
without these gives you something like this:
Notice the nondescript frame dedicated to “java”? Not very helpful.
Running Java with the option -XX:+PreserveFramePointer
(starting in JDK8u60) will expose the stacks. However, without the method name symbols, you get this:
You need to also collect and dump the symbols of the running Java process, so perf
can apply them to the correct stacks. This is made easier by Johannes Rudolph’s perf-map-agent repo. It has some scripts that will dump the Java process symbols and even integrate with the FlameGraph repo to make the graphs for you with one command. It’s pretty slick.
Enter containers.
Containers
Containers, for all their hype and mystery, are still processes on a host. Run a ps
and you can see all container processes running the same as noncontainerized ones.
That Java process is running inside a Docker container, and from the point of view of the host, it has PID 88834
and UID 103
.
Inside the container, that Java process has PID 27
and is owned by the cassandra
user.
Herein lies the issue. Due to a bug in Java, you must dump the process symbols while operating as the owner of the Java process. The perf-map-agent
scripts require it. But the process owner (cassandra
) only exists within the container. Meanwhile, the perf
toolkit must be run as root, and it’s common practice not to allow root within running containers.
So, how can you dump the symbols?
The hack
The hack (“workaround” is too elegant a word) is to run perf
outside on the host, dump the symbols inside the container, and marry the two resulting files in the same space to make a FlameGraph.
More specifically:
- Setup the FlameGraph repo on your host and the
perf-map-agent
repo inside the container where the Java process owner can access it. I also had to alter/etc/passwd
inside the container to give mycassandra
user a shell (usevipw
for safety). - Capture a system profile on the host with something like
sudo perf record -F 99 -a -g -- sleep 30
- From inside the container (easier to have this running already in another shell) dump the symbols for the Java process with
java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar \
net.virtualvoid.perf.AttachOnce PID
- You will now have a
perf-PID.map
file inside/tmp
of the container. Move this file to the host (I used a mounted volume). - Now on the host, rename the
perf-PID.map
file to match the PID of the Java process as seen by the host. For example, my file was namedperf-27.map
but the host has that PID as88834
, so I renamed it toperf-88834.map
- Move the re-named
perf-PID.map
file to your host’s/tmp
directory andchown
it toroot
- You can now proceed with the directions as though containers are not involved. So, create a FlameGraph with
sudo perf script | stackcollapse-perf.pl | flamegraph.pl \
--color=java --hash > flamegraph.svg
You will need to alter this command depending on where your perf.data
file resides in relation to the FlameGraph repo.
Voila! A containerized Java FlameGraph.
Tips:
- Let Java warm up before profiling it to ensure less churn in symbol creation. I let mine run for 15 minutes.
- Run the
perf
profile before dumping the symbols. Switching the order might result in empty stacks, because the symbols were created in the JVM after theperf-PID.map
file.
Why?
Why is this hack needed? Why can’t we dump the symbols outside the container?
At first glance, it seems easy enough to just create a cassandra
user on the host with UID 103. But trying to dump the Java symbols gives us an error:
This is the same behavior you get if you try to dump the symbols as a user who doesn’t own the Java process. So, the host’s cassandra
user can’t attach to a socket. What kind of socket? JMX or UNIX? Not sure. The documentation isn’t super clear.
Even nsenter
fails here:
Walking the same network namespace as process 88834
still doesn’t access the socket.
I talked to several people about this and each conversation ended in puzzlement. Usually I would only post once I had all the answers, but I think it’s good to illustrate that everyone gets stuck sometimes. And it’s better to get the hack out there as a stopgap in the meantime, clunky though it might be. I look forward to a more elegant solution.
Special thanks
I want to thank Brendan Gregg, Johannes Rudolph, and Nitsan Wakart for creating and maintaining the FlameGraph and perf-map-agent repos, as well as helping me initially troubleshoot. Thank you to Jérôme Petazzoni for his unique container systems knowledge and my colleague Mike Hix for poking at namespaces. I am proud to work with all of you and delighted to occasionally stump you.