Add jemalloc heap profiling support #3923

squadgazzz · 2025-11-25T19:59:32Z

Background

Production memory issues are difficult to debug without heap profiling. We need on-demand memory dumps that can be collected from running services without restarts or separate profiling builds.

Description

Adds optional jemalloc heap profiling behind a jemalloc-profiling feature flag. When enabled, services expose a Unix socket that accepts dump requests and streams pprof-compatible heap profiles back.

Changes

Add jemalloc-profiling feature flag to all 6 services
Integrate jemalloc_pprof and tikv-jemallocator
New heap_dump_handler.rs module exposing Unix socket at /tmp/heap_dump_<service>.sock
Update Dockerfile to install netcat-openbsd for socket communication(also useful for existing logs handler)
Add CI workflow .github/workflows/deploy-profiling.yaml for building profiling images

Services affected: orderbook, autopilot, driver, solvers. alerter and refunder are also affected, mostly to simplify the Dockerfile.

Usage

Build with profiling.

Locally:

cargo build --features jemalloc-profiling -p orderbook
export MALLOC_CONF="prof:true,prof_active:true,lg_prof_sample:19"
# then, run the orderbook binary as usual

Using the new deploy-profiling CI job.

Collect heap dump from Kubernetes:

kubectl exec <pod_name> -n <namespace> -- sh -c "echo dump | nc -U /tmp/heap_dump_<binary_name>.sock" > heap.pprof
go tool pprof -http=:8080 heap.pprof

How to test

Already tested on sepolia-staging. Requires bumping the memory a bit. Will continue testing in the shadow env.

Copilot

Pull request overview

This PR adds optional jemalloc heap profiling capabilities to production services for debugging memory issues without requiring restarts or separate profiling builds. The feature is controlled by a jemalloc-profiling feature flag and exposes Unix sockets for on-demand heap dump collection.

Key changes:

New heap_dump_handler module that spawns a Unix socket server for heap dump requests
Feature flag integration across 6 services with conditional jemalloc allocator
Docker and CI infrastructure updates to support profiling builds

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
crates/observe/src/heap_dump_handler.rs	New module implementing Unix socket server for heap dump collection via jemalloc_pprof
crates/observe/src/lib.rs	Conditionally exports heap_dump_handler module under jemalloc-profiling feature
crates/observe/Cargo.toml	Adds jemalloc_pprof dependency and jemalloc-profiling feature flag
crates/*/src/main.rs	Conditionally switches global allocator from mimalloc to jemalloc based on feature flag
crates/*/src/run.rs	Spawns heap dump handler during service initialization
crates/*/Cargo.toml	Adds jemalloc-profiling feature flag with required dependencies
Cargo.toml	Adds workspace-level jemalloc dependencies and updates tempfile version
Cargo.lock	Lock file updates for new dependencies
Dockerfile	Adds build arguments for feature flags and installs netcat-openbsd in final image
.github/workflows/deploy-profiling.yaml	New CI workflow for building and deploying profiling-enabled images

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/observe/src/heap_dump_handler.rs

Dockerfile

crates/solvers/src/run.rs

crates/driver/src/run.rs

crates/observe/src/heap_dump_handler.rs

Cargo.toml

.github/workflows/deploy-profiling.yaml

Copilot

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/observe/src/heap_dump_handler.rs

MartinquaXD

Looks like this should be very helpful!
Did you already confirm that the generated dumps are helpful? I remember you generated some in the past but I don't recall if they also used jemalloc. Can this reasonably run in prod without causing significant overhead or is this still problematic?
If overhead is a concern I think we should be able to use the socket to start profiling on demand as well (instead of always profiling and only dumping on demand).

MartinquaXD · 2025-11-28T12:37:53Z

crates/observe/src/heap_dump_handler.rs

+    // Check if jemalloc profiling is available before spawning the handler
+    // This prevents panics that would crash the entire process
+    let profiling_available =


Probably best to add #[cfg(feature = "jemalloc-profiling")] to the function. You technically still need this check in case we somehow mess up and don't use the jemalloc allocator despite enabling the feature but it better expresses the requirements for this function to work IMO.

Also why does this check have to happen inside catch_unwind? Seems like this should be doable without panicking.

During my tests, jemalloc_pprof::PROF_CTL panicked when a wrong config was provided, since the lib internally uses unsafe operations:

let prof_enabled: bool = unsafe { raw::read(b"opt.prof\0") }.unwrap();

Gated the function with the feature flag.

MartinquaXD · 2025-11-28T12:40:48Z

crates/observe/src/heap_dump_handler.rs

+    let profiling_available =
+        std::panic::catch_unwind(|| jemalloc_pprof::PROF_CTL.as_ref().is_some()).unwrap_or(false);
+
+    if !profiling_available {


Unless I misunderstand something we'll never hit this case with the current code, no?

If jemalloc_pprof::PROF_CTL panics, we return false and then just print a log.

MartinquaXD · 2025-11-28T12:42:08Z

crates/observe/src/heap_dump_handler.rs

+    }
+}
+
+fn binary_name() -> Option<String> {


Looks like some of the code was copy and pasted from the log filter socket stuff. Given that the new code also lives inside observe we should be able to resuse some of it, no?

Yes, we can. I wanted to do the refactoring in a separate PR.

MartinquaXD · 2025-11-28T12:46:00Z

crates/observe/src/heap_dump_handler.rs

+
+    match prof_ctl.dump_pprof() {
+        Ok(pprof_data) => {
+            tracing::info!(size_bytes = pprof_data.len(), "heap dump generated");


I think at this point we can drop the lock on prof_ctl.
Also would be nice to know how long we held the lock here.
Also what does it mean for the remaining process to hold this lock? Are we unable to get new allocations while this lock is held?

Makes sense, done

MartinquaXD · 2025-11-28T12:48:07Z

crates/observe/src/heap_dump_handler.rs

+                        }
+                        Err(_) => {
+                            handle.abort();
+                            tracing::error!("heap dump request timed out after 5 minutes");


Suggested change

tracing::error!("heap dump request timed out after 5 minutes");

tracing::error!(?time, "heap dump request timed out");

Probably nice to store the timeout in a variable and reference it here so we don't run into the case that someone changes the timeout but not the log.
Also how long is this process expected to run? 5 minutes seems like a very long time.

Also how long is this process expected to run? 5 minutes seems like a very long time.

I am not sure. Reduced to 1 min, since the report should be ready in a few seconds.

MartinquaXD · 2025-11-28T12:50:13Z

Dockerfile

 WORKDIR /src/

+# Accept build arguments for enabling features
+ARG CARGO_BUILD_FEATURES=""


Where does value actually get populated? Seems like we always set it to "".

Not always. The CI job uses it to build an image with profiler

services/.github/workflows/deploy-profiling.yaml

Line 64 in e928056

CARGO_BUILD_FEATURES=--features jemalloc-profiling

Basically, the CI's --build-arg overrides the Dockerfile's default.

MartinquaXD · 2025-11-28T12:52:02Z

crates/observe/src/heap_dump_handler.rs

+/// Usage:
+/// ```bash
+/// # From your local machine (one-liner):
+/// kubectl exec <pod> -n <namespace> -- sh -c "echo dump | nc -U /tmp/heap_dump_<binary_name>.sock" > heap.pprof
+///
+/// # Analyze with pprof:
+/// go tool pprof -http=:8080 heap.pprof
+/// ```


Can you please also document this in the README.md?

squadgazzz · 2025-11-28T13:31:49Z

Did you already confirm that the generated dumps are helpful? I remember you generated some in the past but I don't recall if they also used jemalloc. Can this reasonably run in prod without causing significant overhead or is this still problematic?
If overhead is a concern I think we should be able to use the socket to start profiling on demand as well (instead of always profiling and only dumping on demand).

This was already tested in prod a few months ago with no issue, but using a different version, where a simple API endpoint was used instead of Unix sockets, which is considered a more concise approach. I'll be testing it in shadow and staging for a few days. Once confident with that, I will do the same in prod. Only Jemalloc is currently possible.

squadgazzz added 4 commits November 25, 2025 13:34

Init

6bca277

Better config

f797efd

pprof

f798909

Simplified

82bdce5

squadgazzz force-pushed the jemalloc-profiling-support branch from 4d0fd64 to 82bdce5 Compare November 25, 2025 20:13

squadgazzz added 13 commits November 25, 2025 20:26

Fix

6002568

Tmp

bc460c0

Merge branch 'main' into jemalloc-profiling-support

7188583

No redundant output

d242e7d

Safer approach

f6bbe2d

Fix

2487d7d

Redundant loop

dbc59cf

Missing feature

616fe8a

Fix

5a9c244

Debug logs

f5f5d01

Clean-up

c9213ab

Formatting

d5d5cfe

Merge branch 'main' into jemalloc-profiling-support

d63dae0

squadgazzz marked this pull request as ready for review November 27, 2025 20:14

squadgazzz requested a review from a team as a code owner November 27, 2025 20:14

Copilot AI review requested due to automatic review settings November 27, 2025 20:14

Copilot started reviewing on behalf of squadgazzz November 27, 2025 20:15 View session

Copilot finished reviewing on behalf of squadgazzz November 27, 2025 20:18

Fix the CI job

795930f

Copilot AI reviewed Nov 27, 2025

View reviewed changes

squadgazzz marked this pull request as draft November 27, 2025 20:25

squadgazzz added 2 commits November 28, 2025 09:09

Fixes

e928056

Doc fix

4a8acce

squadgazzz requested a review from Copilot November 28, 2025 10:25

Copilot started reviewing on behalf of squadgazzz November 28, 2025 10:25 View session

Copilot finished reviewing on behalf of squadgazzz November 28, 2025 10:28

Copilot AI reviewed Nov 28, 2025

View reviewed changes

Fixes

41725ab

squadgazzz marked this pull request as ready for review November 28, 2025 11:02

MartinquaXD reviewed Nov 28, 2025

View reviewed changes

squadgazzz added 4 commits November 28, 2025 13:52

Review comments

1fdf6c4

Better log

8f39a24

1 minute timeout

678f92e

Update the CI job

9f0c9ad

jmg-duarte approved these changes Nov 28, 2025

View reviewed changes

MartinquaXD approved these changes Dec 1, 2025

View reviewed changes

Merge branch 'main' into jemalloc-profiling-support

5909053

squadgazzz enabled auto-merge December 1, 2025 14:03

squadgazzz added this pull request to the merge queue Dec 1, 2025

Merged via the queue into main with commit cb45ac8 Dec 1, 2025
17 of 18 checks passed

squadgazzz deleted the jemalloc-profiling-support branch December 1, 2025 14:29

github-actions bot locked and limited conversation to collaborators Dec 1, 2025

	tracing::error!("heap dump request timed out after 5 minutes");
	tracing::error!(?time, "heap dump request timed out");

Add jemalloc heap profiling support #3923

Add jemalloc heap profiling support #3923

Uh oh!

Conversation

squadgazzz commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Description

Changes

Usage

How to test

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MartinquaXD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squadgazzz commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

squadgazzz commented Nov 25, 2025 •

edited

Loading