Skip to content

Conversation

@squadgazzz
Copy link
Contributor

@squadgazzz squadgazzz commented Nov 25, 2025

Background

Production memory issues are difficult to debug without heap profiling. We need on-demand memory dumps that can be collected from running services without restarts or separate profiling builds.

Description

Adds optional jemalloc heap profiling behind a jemalloc-profiling feature flag. When enabled, services expose a Unix socket that accepts dump requests and streams pprof-compatible heap profiles back.

Changes

  • Add jemalloc-profiling feature flag to all 6 services
  • Integrate jemalloc_pprof and tikv-jemallocator
  • New heap_dump_handler.rs module exposing Unix socket at /tmp/heap_dump_<service>.sock
  • Update Dockerfile to install netcat-openbsd for socket communication(also useful for existing logs handler)
  • Add CI workflow .github/workflows/deploy-profiling.yaml for building profiling images

Services affected: orderbook, autopilot, driver, solvers. alerter and refunder are also affected, mostly to simplify the Dockerfile.

Usage

Build with profiling.

  • Locally:
cargo build --features jemalloc-profiling -p orderbook
export MALLOC_CONF="prof:true,prof_active:true,lg_prof_sample:19"
# then, run the orderbook binary as usual
  • Using the new deploy-profiling CI job.

Collect heap dump from Kubernetes:

kubectl exec <pod_name> -n <namespace> -- sh -c "echo dump | nc -U /tmp/heap_dump_<binary_name>.sock" > heap.pprof
go tool pprof -http=:8080 heap.pprof

How to test

Already tested on sepolia-staging. Requires bumping the memory a bit. Will continue testing in the shadow env.

@squadgazzz squadgazzz force-pushed the jemalloc-profiling-support branch from 4d0fd64 to 82bdce5 Compare November 25, 2025 20:13
@squadgazzz squadgazzz marked this pull request as ready for review November 27, 2025 20:14
@squadgazzz squadgazzz requested a review from a team as a code owner November 27, 2025 20:14
Copilot AI review requested due to automatic review settings November 27, 2025 20:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional jemalloc heap profiling capabilities to production services for debugging memory issues without requiring restarts or separate profiling builds. The feature is controlled by a jemalloc-profiling feature flag and exposes Unix sockets for on-demand heap dump collection.

Key changes:

  • New heap_dump_handler module that spawns a Unix socket server for heap dump requests
  • Feature flag integration across 6 services with conditional jemalloc allocator
  • Docker and CI infrastructure updates to support profiling builds

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
crates/observe/src/heap_dump_handler.rs New module implementing Unix socket server for heap dump collection via jemalloc_pprof
crates/observe/src/lib.rs Conditionally exports heap_dump_handler module under jemalloc-profiling feature
crates/observe/Cargo.toml Adds jemalloc_pprof dependency and jemalloc-profiling feature flag
crates/*/src/main.rs Conditionally switches global allocator from mimalloc to jemalloc based on feature flag
crates/*/src/run.rs Spawns heap dump handler during service initialization
crates/*/Cargo.toml Adds jemalloc-profiling feature flag with required dependencies
Cargo.toml Adds workspace-level jemalloc dependencies and updates tempfile version
Cargo.lock Lock file updates for new dependencies
Dockerfile Adds build arguments for feature flags and installs netcat-openbsd in final image
.github/workflows/deploy-profiling.yaml New CI workflow for building and deploying profiling-enabled images

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@squadgazzz squadgazzz marked this pull request as draft November 27, 2025 20:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@squadgazzz squadgazzz marked this pull request as ready for review November 28, 2025 11:02
Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be very helpful!
Did you already confirm that the generated dumps are helpful? I remember you generated some in the past but I don't recall if they also used jemalloc. Can this reasonably run in prod without causing significant overhead or is this still problematic?
If overhead is a concern I think we should be able to use the socket to start profiling on demand as well (instead of always profiling and only dumping on demand).

Comment on lines +21 to +23
// Check if jemalloc profiling is available before spawning the handler
// This prevents panics that would crash the entire process
let profiling_available =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to add #[cfg(feature = "jemalloc-profiling")] to the function. You technically still need this check in case we somehow mess up and don't use the jemalloc allocator despite enabling the feature but it better expresses the requirements for this function to work IMO.

Also why does this check have to happen inside catch_unwind? Seems like this should be doable without panicking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my tests, jemalloc_pprof::PROF_CTL panicked when a wrong config was provided, since the lib internally uses unsafe operations:

        let prof_enabled: bool = unsafe { raw::read(b"opt.prof\0") }.unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gated the function with the feature flag.

let profiling_available =
std::panic::catch_unwind(|| jemalloc_pprof::PROF_CTL.as_ref().is_some()).unwrap_or(false);

if !profiling_available {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I misunderstand something we'll never hit this case with the current code, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If jemalloc_pprof::PROF_CTL panics, we return false and then just print a log.

}
}

fn binary_name() -> Option<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of the code was copy and pasted from the log filter socket stuff. Given that the new code also lives inside observe we should be able to resuse some of it, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. I wanted to do the refactoring in a separate PR.


match prof_ctl.dump_pprof() {
Ok(pprof_data) => {
tracing::info!(size_bytes = pprof_data.len(), "heap dump generated");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point we can drop the lock on prof_ctl.
Also would be nice to know how long we held the lock here.
Also what does it mean for the remaining process to hold this lock? Are we unable to get new allocations while this lock is held?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, done

}
Err(_) => {
handle.abort();
tracing::error!("heap dump request timed out after 5 minutes");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tracing::error!("heap dump request timed out after 5 minutes");
tracing::error!(?time, "heap dump request timed out");

Probably nice to store the timeout in a variable and reference it here so we don't run into the case that someone changes the timeout but not the log.
Also how long is this process expected to run? 5 minutes seems like a very long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also how long is this process expected to run? 5 minutes seems like a very long time.

I am not sure. Reduced to 1 min, since the report should be ready in a few seconds.

WORKDIR /src/

# Accept build arguments for enabling features
ARG CARGO_BUILD_FEATURES=""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does value actually get populated? Seems like we always set it to "".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not always. The CI job uses it to build an image with profiler

CARGO_BUILD_FEATURES=--features jemalloc-profiling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, the CI's --build-arg overrides the Dockerfile's default.

Comment on lines +12 to +19
/// Usage:
/// ```bash
/// # From your local machine (one-liner):
/// kubectl exec <pod> -n <namespace> -- sh -c "echo dump | nc -U /tmp/heap_dump_<binary_name>.sock" > heap.pprof
///
/// # Analyze with pprof:
/// go tool pprof -http=:8080 heap.pprof
/// ```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also document this in the README.md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@squadgazzz
Copy link
Contributor Author

Did you already confirm that the generated dumps are helpful? I remember you generated some in the past but I don't recall if they also used jemalloc. Can this reasonably run in prod without causing significant overhead or is this still problematic?
If overhead is a concern I think we should be able to use the socket to start profiling on demand as well (instead of always profiling and only dumping on demand).

This was already tested in prod a few months ago with no issue, but using a different version, where a simple API endpoint was used instead of Unix sockets, which is considered a more concise approach. I'll be testing it in shadow and staging for a few days. Once confident with that, I will do the same in prod. Only Jemalloc is currently possible.

@squadgazzz squadgazzz enabled auto-merge December 1, 2025 14:03
@squadgazzz squadgazzz added this pull request to the merge queue Dec 1, 2025
Merged via the queue into main with commit cb45ac8 Dec 1, 2025
17 of 18 checks passed
@squadgazzz squadgazzz deleted the jemalloc-profiling-support branch December 1, 2025 14:29
@github-actions github-actions bot locked and limited conversation to collaborators Dec 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants