feat: request lifecycle logging via InflightGuard by nnshah1 · Pull Request #7815 · ai-dynamo/dynamo

nnshah1 · 2026-04-02T15:44:02Z

Summary

Add request_id field to InflightGuard with structured "request received" (INFO) and "request completed" (INFO/ERROR) lifecycle logs
Add accessor methods and Display impls for RequestType/ErrorType
Change cancellation logging from trace\! to warn\! with structured fields in disconnect.rs
Add worker lifecycle logs ("request received"/"request completed") in push_handler.rs
All create_inflight_guard call sites now pass request_id

Test plan

cargo check --workspace passes
cargo clippy -p dynamo-llm -p dynamo-runtime --no-deps clean
CI checks pass

Part 2 of 4 for DIS-1643: Consistent Error Tracing
Depends on #7814

🤖 Generated with Claude Code

Summary by CodeRabbit

Improvements
- Enhanced request tracing with unique identifiers throughout the system for improved debugging and observability.
- Added structured logging for request lifecycle events (received, completed, cancelled) with detailed metadata.
- Improved visibility into request metrics with accessible data about models, endpoints, and timing information.
- Elevated disconnect and error logging to provide better visibility into connection and cancellation events.

coderabbitai · 2026-04-02T20:23:02Z

Walkthrough

The pull request extends request tracking and observability throughout the LLM service stack. Changes update the metrics infrastructure to store and expose request identifiers, add lifecycle logging (request received/completed), and thread request IDs through service handlers across HTTP and gRPC protocols to service endpoints including OpenAI, Anthropic, and Tensor. Disconnect monitoring logs are elevated from trace to warning level.

Changes

Cohort / File(s)	Summary
Metrics Infrastructure `lib/llm/src/http/service/metrics.rs`	Extended `InflightGuard` to store `request_id`; added public accessor methods (`request_id()`, `model()`, `endpoint()`, `request_type()`, `error_type()`, `elapsed_ms()`); introduced logging in constructor ("request received") and drop ("request completed") with structured metadata; added `Display` trait impls for `RequestType` and `ErrorType`; updated `ResponseMetricCollector` to include and store ISL.
HTTP Service Endpoints `lib/llm/src/http/service/openai.rs`, `lib/llm/src/http/service/anthropic.rs`	Updated `create_inflight_guard()` invocations in multiple handlers (`completions_single`, `completions_batch`, `chat_completions`, `embeddings`, `images`, `videos`, `audio_speech`, `anthropic_messages`) to pass request identifier (`request_id.clone()` or `ctx.id().to_string()`) as fourth argument; no control flow or error handling changes.
gRPC Service Endpoints `lib/llm/src/grpc/service/openai.rs`, `lib/llm/src/grpc/service/tensor.rs`	Updated `create_inflight_guard()` calls in `completion_response_stream` and `tensor_response_stream` to include `request_id.clone()` as additional parameter.
Disconnect Observability `lib/llm/src/http/service/disconnect.rs`	Upgraded disconnect-related logs from trace to warning level in `connection_monitor`; enhanced `monitor_for_disconnects` to emit structured warning log when request is cancelled by client, including metadata from `inflight_guard` (request_id, model, endpoint, request_type, elapsed_ms).
Runtime Logging `lib/runtime/src/pipeline/network/ingress/push_handler.rs`	Added info-level logging at request lifecycle boundaries in `handle_payload`: "request received" after decoding control message and "request completed" after successful processing.
Test Updates `lib/llm/tests/http_metrics.rs`	Updated test cases to pass empty `String` as new `request_id` parameter to `create_inflight_guard()` signature.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description covers the Summary and Test plan sections but lacks the required Details, Where should the reviewer start, and Related Issues sections from the template.	Add missing template sections: expand Details with specific implementation notes, identify key files for review (especially lib/llm/src/http/service/metrics.rs and lib/llm/src/http/service/disconnect.rs), and explicitly list related GitHub issue `#DIS-1643` using Relates to keyword.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding request lifecycle logging via InflightGuard, which aligns with the primary objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

lib/llm/src/http/service/disconnect.rs (1)

138-153: Add structured fields to unexpected-close warnings.

Lines 138 and 153 currently log plain strings; include request_id, model, endpoint, and request_type for consistent correlation with other lifecycle logs.

Proposed refactor

-            tracing::warn!("Connection closed unexpectedly; issuing cancellation");
+            tracing::warn!(
+                request_id = %engine_context.id(),
+                model = %cancellation_labels.model.as_str(),
+                endpoint = %cancellation_labels.endpoint.as_str(),
+                request_type = %cancellation_labels.request_type.as_str(),
+                "connection closed unexpectedly; issuing cancellation"
+            );
...
-            tracing::warn!("Stream closed unexpectedly; issuing cancellation");
+            tracing::warn!(
+                request_id = %engine_context.id(),
+                model = %cancellation_labels.model.as_str(),
+                endpoint = %cancellation_labels.endpoint.as_str(),
+                request_type = %cancellation_labels.request_type.as_str(),
+                "stream closed unexpectedly; issuing cancellation"
+            );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/disconnect.rs` around lines 138 - 153, Update the
two tracing::warn calls that log unexpected closures to include structured
fields for correlation: replace the plain string warnings in the
ConnectionStatus::ClosedUnexpectedly match arm (the tracing::warn! that follows
metrics.inc_client_disconnect()/metrics.inc_cancellation() and
engine_context.kill()) and the tracing::warn! inside the match handling
stream_rx.await | Ok(ConnectionStatus::ClosedUnexpectedly) to pass request_id,
model, endpoint, and request_type as fields (e.g., tracing::warn!(request_id =
%request_id, model = %model, endpoint = %endpoint, request_type = %request_type,
"Connection closed unexpectedly; issuing cancellation") so the logs align with
other lifecycle logs and preserve the existing message and behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/llm/src/http/service/metrics.rs`:
- Around line 1050-1052: The current mapping for ErrorType::Cancelled in the
metrics/error_detail logic only says "client disconnected before completion"
which mislabels timeout-driven cancellations; update the ErrorType::Cancelled
branch (where `let detail = match self.error_type { ... }` is defined) to use a
broader description such as "request cancelled or timed out before completion"
(or similar wording that covers both client disconnect and timeout) so lifecycle
logs correctly reflect both cancellation causes; leave other ErrorType variants
unchanged.

In `@lib/runtime/src/pipeline/network/ingress/push_handler.rs`:
- Around line 366-369: The unconditional tracing::info!("request completed")
incorrectly logs success even for publish-failure/cancellation paths; change it
so completion is only logged on the successful completion path (e.g., after the
publish/send succeeds) and instead log failure/cancellation with appropriate
tracing::error or tracing::warn in the send-failure/cancel branches; locate the
tracing::info call in push_handler.rs (the spot using context.id()) and
move/remove it from the unconditional location and add targeted logs inside the
success branch and inside the send failure/cancellation branches so each outcome
is recorded accurately.

---

Nitpick comments:
In `@lib/llm/src/http/service/disconnect.rs`:
- Around line 138-153: Update the two tracing::warn calls that log unexpected
closures to include structured fields for correlation: replace the plain string
warnings in the ConnectionStatus::ClosedUnexpectedly match arm (the
tracing::warn! that follows
metrics.inc_client_disconnect()/metrics.inc_cancellation() and
engine_context.kill()) and the tracing::warn! inside the match handling
stream_rx.await | Ok(ConnectionStatus::ClosedUnexpectedly) to pass request_id,
model, endpoint, and request_type as fields (e.g., tracing::warn!(request_id =
%request_id, model = %model, endpoint = %endpoint, request_type = %request_type,
"Connection closed unexpectedly; issuing cancellation") so the logs align with
other lifecycle logs and preserve the existing message and behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c711e7a-3a08-4006-8af1-56bda5c90952

📥 Commits

Reviewing files that changed from the base of the PR and between 13058cb and 215ab79.

📒 Files selected for processing (8)

lib/llm/src/grpc/service/openai.rs
lib/llm/src/grpc/service/tensor.rs
lib/llm/src/http/service/anthropic.rs
lib/llm/src/http/service/disconnect.rs
lib/llm/src/http/service/metrics.rs
lib/llm/src/http/service/openai.rs
lib/llm/tests/http_metrics.rs
lib/runtime/src/pipeline/network/ingress/push_handler.rs

coderabbitai · 2026-04-02T20:23:05Z

+                let detail = match self.error_type {
+                    ErrorType::Cancelled => "client disconnected before completion",
+                    ErrorType::Internal => "internal server error during processing",


⚠️ Potential issue | 🟡 Minor

Don't label every Cancelled request as a client disconnect.

ErrorType::Cancelled is documented as covering client cancellation and timeout cases, but the new error_detail only describes the client side. That will mislabel timeout paths in the lifecycle logs.

📝 Suggested wording

- ErrorType::Cancelled => "client disconnected before completion", + ErrorType::Cancelled => "request cancelled or timed out before completion",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let detail = match self.error_type {

ErrorType::Cancelled => "client disconnected before completion",

ErrorType::Internal => "internal server error during processing",

let detail = match self.error_type {

ErrorType::Cancelled => "request cancelled or timed out before completion",

ErrorType::Internal => "internal server error during processing",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/metrics.rs` around lines 1050 - 1052, The current mapping for ErrorType::Cancelled in the metrics/error_detail logic only says "client disconnected before completion" which mislabels timeout-driven cancellations; update the ErrorType::Cancelled branch (where `let detail = match self.error_type { ... }` is defined) to use a broader description such as "request cancelled or timed out before completion" (or similar wording that covers both client disconnect and timeout) so lifecycle logs correctly reflect both cancellation causes; leave other ErrorType variants unchanged.

coderabbitai · 2026-04-02T20:23:06Z

+        tracing::info!(
+            request_id = %context.id(),
+            "request completed"
+        );


⚠️ Potential issue | 🟠 Major

Avoid unconditional "request completed" logging.

At Line 366, this log is emitted even after publish-failure/cancellation paths (e.g., send failure branch), which can misreport failed/cancelled flows as successful completions.

Proposed fix

- tracing::info!( - request_id = %context.id(), - "request completed" - ); + let elapsed_ms = start_time.elapsed().as_millis(); + if send_complete_final { + tracing::info!( + request_id = %context.id(), + status = "success", + elapsed_ms = %elapsed_ms, + "request completed" + ); + } else if context.is_stopped() { + tracing::warn!( + request_id = %context.id(), + status = "cancelled", + elapsed_ms = %elapsed_ms, + "request completed" + ); + } else { + tracing::error!( + request_id = %context.id(), + status = "error", + elapsed_ms = %elapsed_ms, + "request completed" + ); + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/runtime/src/pipeline/network/ingress/push_handler.rs` around lines 366 - 369, The unconditional tracing::info!("request completed") incorrectly logs success even for publish-failure/cancellation paths; change it so completion is only logged on the successful completion path (e.g., after the publish/send succeeds) and instead log failure/cancellation with appropriate tracing::error or tracing::warn in the send-failure/cancel branches; locate the tracing::info call in push_handler.rs (the spot using context.id()) and move/remove it from the unconditional location and add targeted logs inside the success branch and inside the send failure/cancellation branches so each outcome is recorded accurately.

…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add "request received" (INFO) and "request completed" (INFO/ERROR) logs to InflightGuard with structured fields (request_id, model, endpoint, request_type, status, elapsed_ms). Add cancellation WARN in disconnect monitor. Add worker lifecycle logs in push_handler. All call sites now pass request_id to create_inflight_guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nnshah1 · 2026-04-02T23:02:22Z

Superseded by combined PR #7840

nnshah1 requested a review from a team as a code owner April 2, 2026 15:44

pull-request-size Bot added the size/L label Apr 2, 2026

github-actions Bot added the feat label Apr 2, 2026

nnshah1 mentioned this pull request Apr 2, 2026

feat: token counts, TTFT, ITL, worker IDs on span #7816

Closed

2 tasks

github-actions Bot added the frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` label Apr 2, 2026

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 2564a0d to 26a7f55 Compare April 2, 2026 16:41

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 16:41 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 26a7f55 to 9c449c6 Compare April 2, 2026 16:48

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 16:48 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 9c449c6 to 5a81d75 Compare April 2, 2026 17:09

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 17:09 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 5a81d75 to 613f3bd Compare April 2, 2026 17:17

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from d645665 to 8f3cfd4 Compare April 2, 2026 17:17

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 17:17 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 613f3bd to ea87d0a Compare April 2, 2026 17:20

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from 8f3cfd4 to 6a1e883 Compare April 2, 2026 17:20

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 17:20 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from 6a1e883 to 5307cb8 Compare April 2, 2026 17:26

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from ea87d0a to 26e51f1 Compare April 2, 2026 17:26

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 17:26 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from 5307cb8 to f871386 Compare April 2, 2026 17:35

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 26e51f1 to de44836 Compare April 2, 2026 17:35

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 17:35 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from f871386 to ad8c70c Compare April 2, 2026 17:55

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from de44836 to c5695ff Compare April 2, 2026 19:12

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from ad8c70c to 789d06b Compare April 2, 2026 19:12

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 19:12 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from c5695ff to f8b283d Compare April 2, 2026 19:50

nnshah1 requested a review from a team as a code owner April 2, 2026 19:50

nnshah1 requested a review from a team April 2, 2026 19:50

nnshah1 changed the base branch from main to nnshah1/DIS-1643-pr1-id-propagation April 2, 2026 20:15

pull-request-size Bot added size/XXL and removed size/L labels Apr 2, 2026

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 215ab79 to fb3a884 Compare April 2, 2026 20:17

pull-request-size Bot added size/L and removed size/XXL labels Apr 2, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 20:17 Inactive

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from fb3a884 to 5657e03 Compare April 2, 2026 20:36

nnshah1 changed the base branch from nnshah1/DIS-1643-pr1-id-propagation to nnshah1/DIS-1643-pr1-followup April 2, 2026 20:36

pull-request-size Bot added size/XXL and removed size/L labels Apr 2, 2026

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-followup branch from eb1df1f to 140696c Compare April 2, 2026 20:59

nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 5657e03 to 92676d5 Compare April 2, 2026 20:59

pull-request-size Bot added size/L and removed size/XXL labels Apr 2, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 2, 2026 20:59 Inactive

nnshah1 force-pushed the nnshah1/DIS-1643-pr1-followup branch 4 times, most recently from db13cf6 to 9a9e8ff Compare April 2, 2026 22:11

Base automatically changed from nnshah1/DIS-1643-pr1-followup to main April 2, 2026 23:00

pull-request-size Bot added size/XL and removed size/L labels Apr 2, 2026

nnshah1 closed this Apr 2, 2026

copy-pr-bot Bot had a problem deploying to GITLAB April 3, 2026 01:36 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: request lifecycle logging via InflightGuard#7815

feat: request lifecycle logging via InflightGuard#7815
nnshah1 wants to merge 2 commits into
mainfrom
nnshah1/DIS-1643-pr2-lifecycle-logging

nnshah1 commented Apr 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 2, 2026

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Uh oh!

coderabbitai Bot Apr 2, 2026

Uh oh!

nnshah1 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nnshah1 commented Apr 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 2, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

nnshah1 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nnshah1 commented Apr 2, 2026 •

edited by coderabbitai Bot

Loading