eBPF to monitor heap usage of Java app

Jan 11, 2025

Article voiceover

1×

0:00

-13:06

In the beginning of December 2024 I stumbled upon an interesting tutorial related to eBPF (Extended Berkeley Packet Filter) - Capturing User-Space Java GC Duration Using USDT . The non-invasive approach with eBPF captured my attention because of two important reasons:

It allows to get observability of a running app without requiring any changes in the app
- Does not require adding Java agents;
- Does not require enabling JFR.
The running app is completely unaware that it is monitored.

This potentially allows getting basic observability of JVM-based apps to be provided on Platform level instead of being responsibility of a developer (to be clear, this isn't a replacement for Java Flight Recorder (JFR) and/or Java Agents like OpenTelemetry Java Agent).
It was super exciting and fun for me to dive into an unknown area and I'm happy to share it with you. Hope you find this interesting as well! The approach is obviously "standing on the shoulders of giants" so to speak and I got inspiration from:

eunomia-bpf project with their eBPF tutorials;
ebpf.io/;
BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more;
bpftrace - High-level tracing language for Linux;
libbpf and Rust port libbpf-rs.

What is eBPF program

eBPF program is a lightweight, custom code executed within the Linux kernel in a sandboxed environment that attaches to specific hook points in the kernel or user space applications, enabling real-time monitoring, tracing, and modification of their behavior without requiring changes to the application code. There are many eBPF program types available and we're going to use the following three types:

tracepoint for pre-defined trace points in the Linux Kernel;
uprobe for probing (intercepting and monitoring) userspace entry or exit points;
User-space Statically Defined Tracepoint (USDT) for probing built-in tracepoints in JVM.

We will need bpfcc-tools to help us quickly iterate with ideas, so I'd suggest to install it if you want to try out.

Prototyping with bpfcc-tools

GC-related USDTs in JVM

DTraces are available since JDK6.0 and all subsequent versions have DTrace support, however, DTrace is disabled by default for OpenJDK builds, more in Notes.

The official documentation on DTrace Probes in HotSpot VM contains important details of the probes. DTraces (for Java >= 11) are defined in jdk/src/hotspot/os/posix/dtrace/hotspot.d . We are interested in two GC-related probes:

HOTSPOT_MEM_POOL_GC_BEGIN, probe that fires when an individual memory pool is about to be collected, invoked from jdk/src/hotspot/share/services/memoryManager.cpp#L230
HOTSPOT_MEM_POOL_GC_END, probe that fires after an individual memory pool has been collected, invoked from jdk/src/hotspot/share/services/memoryManager.cpp#L263

Both accept the following arguments:

args[0] - A pointer to UTF-8 string data which contains the name of the manager that manages this memory pool
args[1] - The length of the manager name data (in bytes).
args[2] - A pointer to UTF-8 string data which contains the name of the memory pool.
args[3] - The length of the memory pool name data (in bytes).
args[4] - The initial size of the memory pool (in bytes).
args[5] - The amount of memory in use in the memory pool (in bytes).
args[6] - The number of committed pages in the memory pool.
args[7] - The maximum size of the memory pool.

I'm going to use this simple Java app as a target Java app to monitor heap usage

Let's compile and run it in a separate terminal javac InfiniteApp.java && java -Xmx500M InfiniteApp

In a new terminal let's run bpftrace and attach to USDT probe mem__pool__gc__begin of that java process with pid 14435:

sudo bpftrace -p 14435 \  
    -e 'usdt:/home/user/.sdkman/candidates/java/21.0.5-zulu/lib/server/libjvm.so:hotspot:mem__pool__gc__begin  
{  
    printf("manager: %s, pool: %s, init: %d bytes, used: %d bytes\n", str(arg0), str(arg2), arg4, arg5);  
}'

To attach to USDT probe mem__pool__gc__end we can modify the command slightly

sudo bpftrace -p 14435 \  
    -e 'usdt:/home/user/.sdkman/candidates/java/21.0.5-zulu/lib/server/libjvm.so:hotspot:mem__pool__gc__end  
{  
    printf("manager: %s, pool: %s, init: %d bytes, used: %d bytes\n", str(arg0), str(arg2), arg4, arg5);  
}'

Tracepoint to detect app exits

We will need to have a way to detect when a monitored app exits. It can be done from user space using kill -0 or from the kernel space by attaching to sched/sched_process_exit tracepoint.

sudo bpftrace -p 32613 \  
    -e 'tracepoint:sched:sched_process_exit  
{  
    // Use the correct Java process id!  
    if (pid != 32613) { return 0; }  
  
    $task = (struct task_struct *)curtask;  
    $exit_code = $task->exit_code >> 8;  
    printf("pid: %d, tid: %d, command: %s, exit_code: %d\n", pid, tid, comm, $exit_code);  
}'

Uprobe on internal JVM tracing

As I mentioned in Notes, OpenJDK build has DTraces disabled. The approach with USDT won't work unless we build OpenJDK with enabled DTraces. Is there a way to get GC events without DTraces? After reading OpenJDK source code I came-up with the following idea:

JVM internally reports the state of the heap right before GC and immediately after GC
- CollectedHeap::trace_heap_before_gc
- CollectedHeap::trace_heap_after_gc
- The reporting is done by calling the public method GCTracer::report_gc_heap_summary
Assuming we can find that method in libjvm.so then we can put hook (uprobe) on that method and receive the arguments passed to it. We need to understand the memory layout of C++ class GCHeapSummary to be able to read it from plain C in eBPF.

Finding report_gc_heap_summary

We can use readelf -Ws with grep to check whether a symbol exists in shared library. I checked different versions and builds of Java and all of them contain that symbol. Because it is C++ method, the name is mangled _ZNK8GCTracer22report_gc_heap_summaryEN6GCWhen4TypeERK13GCHeapSummary

As you can notice the found symbol has local binding and hidden visibility. Obviously it is bad idea to rely on hidden functions because it compromises the stability and maintainability and introduces risks related to compatibility (even though report_gc_heap_summary is a public method in GCTracer, developers can change signature, modify the argument's type of that method or even remove that method). But let's try to hook into it just for educational purpose.

Memory layout of GCHeapSummary

I'm not C++ expert, so please take the following with grain of salt. Basically I extracted relevant types to C++ program that finds the offset of the fields. It produces the following result on x64-64

	➜  /tmp uname -a
	Linux home-lab 6.11.0-13-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Sat Nov 30 23:51:51 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
	➜  /tmp g++ --version                         
	g++ (Ubuntu 14.2.0-4ubuntu2) 14.2.0
	Copyright (C) 2024 Free Software Foundation, Inc.
	This is free software; see the source for copying conditions.  There is NO
	warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
	
	➜  /tmp g++ memory_layout.cpp -o memory_layout
	➜  /tmp ./memory_layout 
	sizeof(Type): 4 bytes
		 Type::BeforeGC is 0
		 Type::AfterGC is 1
		 Type::GCWhenEndSentinel is 2
	The offsets of the fields of GCHeapSummary class
		 _heap: 8 bytes
		 _used: 32 bytes
	The offsets of the fields of G1HeapSummary class
		 _heap: 8 bytes
		 _used: 32 bytes
		 _edenUsed: 40 bytes
		 _edenCapacity: 48 bytes
		 _survivorUsed: 56 bytes
		 _oldGenUsed: 64 bytes
		 _numberOfRegions: 72 bytes

clang

	➜  /tmp uname -a                                  
	Linux home-lab 6.11.0-13-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Sat Nov 30 23:51:51 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
	➜  /tmp clang++ --version                         
	Ubuntu clang version 19.1.1 (1ubuntu1)
	Target: x86_64-pc-linux-gnu
	Thread model: posix
	InstalledDir: /usr/lib/llvm-19/bin
	➜  /tmp clang++ memory_layout.cpp -o memory_layout
	➜  /tmp ./memory_layout 
	sizeof(Type): 4 bytes
		 Type::BeforeGC is 0
		 Type::AfterGC is 1
		 Type::GCWhenEndSentinel is 2
	The offsets of the fields of GCHeapSummary class
		 _heap: 8 bytes
		 _used: 32 bytes
	The offsets of the fields of G1HeapSummary class
		 _heap: 8 bytes
		 _used: 32 bytes
		 _edenUsed: 40 bytes
		 _edenCapacity: 48 bytes
		 _survivorUsed: 56 bytes
		 _oldGenUsed: 64 bytes
		 _numberOfRegions: 72 bytes

We can see that _used field starts at offset 32. Let's try to put uprobe on report_gc_heap_summary and read the data as uint64 starting from that offset

sudo bpftrace -p 69320 \
    -e 'uprobe:/home/user/.sdkman/candidates/java/21.0.5-zulu/lib/server/libjvm.so:_ZNK8GCTracer22report_gc_heap_summaryEN6GCWhen4TypeERK13GCHeapSummary
{
    // void report_gc_heap_summary(GCWhen::Type when, const GCHeapSummary& heap_summary) const;
    // The first arg is pointer to class, we skip it.
    // The second argument, `arg1` indicates whether it is before (0) or after GC  (1)
    // The third argument, `arg2` points to GCHeapSummary. We use 32 bytes offset to read `_used` field
    $offset = (uint64)32;
    $value = *(uint64 *)(arg2 + $offset);
    printf("command: %s, when: %d, used: %d\n", comm, arg1, $value);
}'

Ok, we managed to hook into USDTs, tracepoint and uprobes. However, it is all done using bpftrace tool. It is time now to define on a high level what we want to achieve and convert these small scripts to a Rust application with small C code that uses libbpf-rs to work with eBPF.

Building the tool

Below is a diagram of the system using C4 Model , so-called container diagram. The repository with code is located at REASY/jheapusage

Let's zoom in into jheapusage, the core component that deploys and manages eBPF programs.

jheapusage

It is Rust application that relies on libbpf-rs as tooling for eBPF. On high level it does the following:

Recognizes whether target Java app runs in a container, it can handle namespaces
Checks whether target app is actually a Java app (the check is quite simple, it tries to find hsperfdata folder, the same way as jps does in local mode)
Estimates system's boot time as Unix timestamp with nanoseconds precision. This is needed because Linux Kernel only provides method bpf_ktime_get_ns that return the time elapsed since system boot!
Finds report_gc_heap_summary symbol in libjvm.so to have exact name for uprobe
Initializes eBPF and attaches to USDTs, tracepoint and uprobe. During initialization sets the following fields for eBPF program
- st_dev
- st_ino
- target_userspace_pid
- boot_time_ns
- sampling_interval_ns
Initializes OpenTelemetry for metrics
Creates Ring Buffers to receive GC events from the Kernel and connects them with event handlers. Event handlers push received event as metric to OTLP.
Keeps polling ring buffers while the target app is still running

Simplified C4 component diagram for jheapusage

Rust CLI application has the following code structure:

src/main.rs - the program entry point
src/ebpf.rs - handles loading of eBPF and attaching to different hooks (USDTs, uprobes, tracepoints)
src/ebpf/jvm.bpf.c- eBPF programs written in C with logic defined for the hooks
src/ebpf/jvm.h - common C types and methods used by src/ebpf/jvm.bpf.c
src/ebpf/jvm.skel.rs - auto-generated Rust binding for C eBPF. Generated by build.rs, relies on libbpf-cargo
src/handlers.rs - receives RingBuffer data and pushes them to queue for async processing later
src/events.rs - defines how events are formatted as text and marks events as implementing plain::Plain to simplify C struct to Rust struct convertion
src/isolation.rs - executes provided Rust function in the namespace of target PID. This is needed to get username inside the namespace to be able to find correct hsperfdata path
src/logger.rs - configures and manages tracing logs
src/otlp.rs - exports events to Open Telemetry OTLP endpoint
src/utils.rs - shared utility functions
src/errors.rs - central error type definition

My Eureka Moments

One can understand whether a process runs in a container/namespace isolation by reading /proc/{pid}/status file. Isn't it cool and so simple at the same time? :)
1. No isolation

b. Running in a container

To enter a namespace one can use setns syscall, just need to provide file descriptor and namespace type. nsenter from sys-utils uses it
eBPF subsystem does not provide an API to get current time as Unix timestamp. However, it provides bpf_ktime_get_ns that returns the time elapsed since system boot, in nanoseconds. We can estimate system boot time as Unix timestamp in nanoseconds by estimating the relationship between CLOCK_REALTIME and CLOCK_BOOTTIME using clock_gettime, implementation is in estimate_system_boot_time
bpf_get_current_pid_tgid returns u64 containing the current tgid and pid. tgid is what is process id in user space and pid is what is thread id in users pace. I wrote a helper function extract_userspace_ids to not get confused
To embed skeleton object in Rust struct, libbpf_rs::skelSkel I had to do some unsafe Rust type-system dancing

OTLP + Grafana + Prometheus stack

I used opentelemetry-demo/docker-compose.yml as a base version and modified it to have only what we need. You can find details in jheapusage/otlp folder. To start the stack open shell in that folder and run docker compose up -d

Running example

I created this small video to show the usage of this app.

Notes

OpenJDK builds have DTraces disabled, see JDK-8255612: Explicitly disable dtrace for Oracle OpenJDK Linux builds #946, so I use Azul Zulu Builds of OpenJDK Java 21 (I use SDKMAN! to take care of different versions of Java on my machine). We can confirm that Zulu OpenJDK has DTraces enabled by running tplist-bpfcc -l #PATH_TO_LIBJVM.SO# to verify the existence of those two traces in libjvm.so:

At the time of writing I used bpftrace v0.21.2;
Because of hooking into GC collection events the approach is will only work if there are GC collection, otherwise we won’t receive any events and it will look like Java heap isn’t changing at all (push-based approach). This isn’t the case when pull-based approach is used like in JVM agents when they periodically read memory usage and report it (OpenTelemetry Java Agent uses MemoryPoolMXBean::getUsage, VisualVM uses MemoryMXBean::getHeapMemoryUsage);
I used NotebookLM to create a voice over for this article.

Art’s Substack

Discussion about this post