feat: trace infrastructure — spans + request ID propagation [DIS-1643]#7733
Conversation
WalkthroughRequest-id resolution was made fallible: handlers now validate Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/llm/src/http/service/anthropic.rs`:
- Line 126: The call to get_or_create_request_id currently maps errors directly
to into_response(), which returns the OpenAI-shaped (StatusCode,
Json<ErrorMessage>) body; instead catch the Err from get_or_create_request_id
and map it through anthropic_error (or the shared helper that returns
Anthropic-style status+message) before returning so invalid x-dynamo-request-id
yields an Anthropic-formatted 400; update the expression around
get_or_create_request_id(&headers).map_err(...) to call anthropic_error(...) (or
the helper) rather than into_response(), referencing get_or_create_request_id,
anthropic_error, and into_response to find the callsite.
In `@lib/llm/src/http/service/openai.rs`:
- Around line 292-317: The function get_or_create_request_id currently validates
the header but discards it unless get_distributed_tracing_context() supplies an
ID, and to_str() failures silently fall through; change the logic so that if
headers.get(DYNAMO_REQUEST_ID_HEADER) is present you call raw.to_str() and
return a BAD_REQUEST error on to_str() failure, then parse the string with
uuid::Uuid::parse_str and return BAD_REQUEST if parsing fails, otherwise keep
that validated header value as a candidate; then return
trace_context.x_dynamo_request_id if present, else the validated header value if
present, else generate a new UUID. Make these changes in
get_or_create_request_id and reference DYNAMO_REQUEST_ID_HEADER and
get_distributed_tracing_context accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fee1cb8a-7e67-4468-b61d-adfb0b4eb4ca
📒 Files selected for processing (4)
lib/llm/src/http/service/anthropic.rslib/llm/src/http/service/openai.rslib/runtime/src/logging.rslib/runtime/src/pipeline/network/ingress/push_endpoint.rs
9620fc0 to
1667513
Compare
1667513 to
361fd29
Compare
cd9d3df to
d2693ff
Compare
d2693ff to
3f5d48b
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/llm/src/http/service/openai.rs (1)
299-300: Update stale span helper name in comments.Line 299 and Line 324 still reference
make_request_span(), but the codebase now usesmake_inference_request_span(). This can mislead future maintainers.Also applies to: 324-324
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/openai.rs` around lines 299 - 300, Update the outdated helper name in the comments: replace references to make_request_span() with make_inference_request_span() in the comment blocks around the OpenAI HTTP service (e.g., the comment that starts "The request ID comes from the trace context..." and the later reference at the other occurrence). Ensure both comments now mention make_inference_request_span() so they match the actual helper function name used in the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 307-316: The request-ID validation error messages returned from
the UUID and UTF-8 checks should be prefixed with VALIDATION_PREFIX so they are
classified as validation (400) errors by classify_error_for_metrics; update the
error returns in the OpenAI request-ID parsing branch (the arms that currently
return "{} header must be a valid UTF-8 string" and "{} header must be a valid
UUID, got: {}") to prepend VALIDATION_PREFIX (the same constant used elsewhere)
to the message so downstream ErrorMessage handling and
classify_error_for_metrics treat them as validation errors.
---
Nitpick comments:
In `@lib/llm/src/http/service/openai.rs`:
- Around line 299-300: Update the outdated helper name in the comments: replace
references to make_request_span() with make_inference_request_span() in the
comment blocks around the OpenAI HTTP service (e.g., the comment that starts
"The request ID comes from the trace context..." and the later reference at the
other occurrence). Ensure both comments now mention
make_inference_request_span() so they match the actual helper function name used
in the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 39c4d1ed-4a42-4770-96cb-cc0da885c26c
📒 Files selected for processing (6)
lib/llm/src/http/service/anthropic.rslib/llm/src/http/service/openai.rslib/llm/src/http/service/service_v2.rslib/runtime/src/logging.rslib/runtime/src/pipeline/network/ingress/push_endpoint.rslib/runtime/src/system_status_server.rs
✅ Files skipped from review due to trivial changes (1)
- lib/llm/src/http/service/service_v2.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/runtime/src/pipeline/network/ingress/push_endpoint.rs
Implementation Plan — DIS-1643: Consistent Error TracingOverview4 stacked PRs implementing consistent structured logging across all frontend error paths and worker trace propagation. PR1: Trace Infrastructure (#7733)Problem: Request spans use Solution:
Files: PR2: Request Lifecycle Logging (#7734)Problem: No consistent "request received" / "request completed" log. Error paths have different log formats. Worker has no request-level logs at INFO. Solution:
Files: PR3: Token Counts, TTFT, ITL, Worker IDs (#7735)Problem: No token counts, latency metrics, or worker identification on request completion logs. Solution:
Performance: All additions are on cleanup path (Drop), not streaming hot path. Two f64/u64 accumulations per chunk (negligible alongside existing histogram publish). Files: PR4: E2E Tests (#7766)11 parallel-safe pytest tests (25s with
Request ID Propagation Flow"request completed" Fields (streaming){
"level": "INFO",
"message": "request completed",
"status": "success",
"request_id": "...",
"model": "qwen/qwen3-0.6b",
"endpoint": "chat_completions",
"request_type": "stream",
"elapsed_ms": "20",
"input_tokens": "9",
"output_tokens": "50",
"ttft_ms": "5.85",
"avg_itl_ms": "0.29",
"trace_id": "..."
}Follow-up Issues |
3f5d48b to
3ea53e1
Compare
3ea53e1 to
386137b
Compare
- Rename make_request_span → make_inference_request_span with target: "request_span" (always on via filter directive) - Add make_system_request_span with target: "system_span" (debug level) - Add "request_span=trace" directive in filters() - Simplify get_or_create_request_id() — validates UUID, returns Result<String, String> - Update worker spans to target: "request_span" - Worker system_status_server uses make_system_request_span Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
386137b to
67c6f13
Compare
|
Start frontend like so: Start mock worker like so: Send a chat completions request Frontend logs for request: {
"time": "2026-04-02T01:14:11.069507Z",
"level": "DEBUG",
"file": "/Users/tmontfort/Dynamo/repos/dynamo/lib/runtime/src/pipeline/network/egress/push_router.rs",
"line": 552,
"target": "dynamo_runtime::pipeline::network::egress::push_router",
"message": "Using TCP transport for instance",
"instance_id": 7587893895631993984,
"method": "POST",
"span_id": "cafd5bb2662ad1aa",
"span_name": "http-request",
"tcp_endpoint": "192.168.1.164:56530/694d9d4bb27ffc80/generate",
"trace_id": "646a97a0ca50f33dfe7269a341ec53a8",
"uri": "/v1/chat/completions",
"version": "HTTP/1.1",
"x_dynamo_request_id": "0f4ac56e-5bd1-4657-bf8b-39cdcde9036b"
}
{
"time": "2026-04-02T01:14:11.069606Z",
"level": "DEBUG",
"file": "/Users/tmontfort/Dynamo/repos/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs",
"line": 243,
"target": "dynamo_runtime::pipeline::network::tcp::server",
"message": "Registering new TcpStream on 192.168.1.164:56573",
"parent_id": "cafd5bb2662ad1aa",
"request_id": "0f4ac56e-5bd1-4657-bf8b-39cdcde9036b",
"router_mode": "RoundRobin",
"span_id": "6e09efb48634df43",
"span_name": "router.route_request",
"trace_id": "646a97a0ca50f33dfe7269a341ec53a8",
"worker_id": "7587893895631993984"
}
{
"time": "2026-04-02T01:14:11.069738Z",
"level": "DEBUG",
"file": "/Users/tmontfort/Dynamo/repos/dynamo/lib/runtime/src/pipeline/network/egress/tcp_client.rs",
"line": 510,
"target": "dynamo_runtime::pipeline::network::egress::tcp_client",
"message": "TCP client sending request to address: 192.168.1.164:56530/694d9d4bb27ffc80/generate",
"parent_id": "cafd5bb2662ad1aa",
"request_id": "0f4ac56e-5bd1-4657-bf8b-39cdcde9036b",
"router_mode": "RoundRobin",
"span_id": "6e09efb48634df43",
"span_name": "router.route_request",
"trace_id": "646a97a0ca50f33dfe7269a341ec53a8",
"worker_id": "7587893895631993984"
}
{
"time": "2026-04-02T01:14:11.075612Z",
"level": "DEBUG",
"file": "/Users/tmontfort/Dynamo/repos/dynamo/lib/llm/src/http/service/service_v2.rs",
"line": 548,
"target": "dynamo_llm::http::service::service_v2",
"message": "request completed",
"latency_ms": "8",
"method": "POST",
"span_id": "cafd5bb2662ad1aa",
"span_name": "http-request",
"status": "200",
"trace_id": "646a97a0ca50f33dfe7269a341ec53a8",
"uri": "/v1/chat/completions",
"version": "HTTP/1.1",
"x_dynamo_request_id": "0f4ac56e-5bd1-4657-bf8b-39cdcde9036b"
}We see both |
…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ai-dynamo#7733) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
x_dynamo_request_idUUID inmake_inference_request_span()when client doesn't provide one — ensures ID exists from first log and propagates to workersx-dynamo-request-idheader is valid UUID (returns 400 if invalid)request_span(inference, always on) vssystem_span(health/metrics, debug level)request_span=tracefilter directive so inference spans survive anyDYN_LOGsettingget_or_create_request_id()to returnResult<String, String>— callers format errors for their API (OpenAI vs Anthropic)Design Decisions
Custom span targets instead of
error_span!: The tracing ecosystem has noalways_on_span!. We useinfo_span!(target: "request_span", ...)with arequest_span=tracedirective infilters()— same pattern as existingspan_event=trace. Overridable viaDYN_LOG=request_span=off.System vs inference separation:
service_v2.rsrouter split into two groups with separateTraceLayer. System endpoints (health, metrics, models) usedebug_span!(target: "system_span")— invisible atDYN_LOG=info, visible atDYN_LOG=debug.UUID validation returns
Result<String, String>: The error is a plain message string. OpenAI handlers convert viaErrorMessage::from_http_error(HttpError{...}), Anthropic viaanthropic_error(...). Each endpoint returns its native error format.Files Changed
lib/runtime/src/logging.rsmake_inference_request_span+make_system_request_span+request_span=tracedirective + worker span targetslib/llm/src/http/service/openai.rsget_or_create_request_id()simplified, all call sites updatedlib/llm/src/http/service/anthropic.rslib/llm/src/http/service/service_v2.rslib/runtime/src/system_status_server.rsmake_system_request_spanlib/runtime/src/.../push_endpoint.rsrequest_spantargetTest plan
cargo fmt+cargo clippycleanx-dynamo-request-id→ UUID auto-generated, visible in all logsDYN_LOG=info→ no system endpoint logs, inference logs visibleDYN_LOG=debug→ both visiblex_dynamo_request_idmatching frontend🤖 Generated with Claude Code
Linear: DIS-1643
Summary by CodeRabbit
Release Notes
Bug Fixes
400 Bad Requestresponses instead of being silently accepted.Improvements