Skip to content

feat: request lifecycle logging via InflightGuard#7815

Closed
nnshah1 wants to merge 2 commits into
mainfrom
nnshah1/DIS-1643-pr2-lifecycle-logging
Closed

feat: request lifecycle logging via InflightGuard#7815
nnshah1 wants to merge 2 commits into
mainfrom
nnshah1/DIS-1643-pr2-lifecycle-logging

Conversation

@nnshah1
Copy link
Copy Markdown
Contributor

@nnshah1 nnshah1 commented Apr 2, 2026

Summary

  • Add request_id field to InflightGuard with structured "request received" (INFO) and "request completed" (INFO/ERROR) lifecycle logs
  • Add accessor methods and Display impls for RequestType/ErrorType
  • Change cancellation logging from trace\! to warn\! with structured fields in disconnect.rs
  • Add worker lifecycle logs ("request received"/"request completed") in push_handler.rs
  • All create_inflight_guard call sites now pass request_id

Test plan

  • cargo check --workspace passes
  • cargo clippy -p dynamo-llm -p dynamo-runtime --no-deps clean
  • CI checks pass

Part 2 of 4 for DIS-1643: Consistent Error Tracing
Depends on #7814

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Improvements
    • Enhanced request tracing with unique identifiers throughout the system for improved debugging and observability.
    • Added structured logging for request lifecycle events (received, completed, cancelled) with detailed metadata.
    • Improved visibility into request metrics with accessible data about models, endpoints, and timing information.
    • Elevated disconnect and error logging to provide better visibility into connection and cancellation events.

@nnshah1 nnshah1 requested a review from a team as a code owner April 2, 2026 15:44
@github-actions github-actions Bot added the feat label Apr 2, 2026
@github-actions github-actions Bot added the frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` label Apr 2, 2026
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 2564a0d to 26a7f55 Compare April 2, 2026 16:41
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 26a7f55 to 9c449c6 Compare April 2, 2026 16:48
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 9c449c6 to 5a81d75 Compare April 2, 2026 17:09
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 5a81d75 to 613f3bd Compare April 2, 2026 17:17
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from d645665 to 8f3cfd4 Compare April 2, 2026 17:17
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 613f3bd to ea87d0a Compare April 2, 2026 17:20
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from 8f3cfd4 to 6a1e883 Compare April 2, 2026 17:20
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from 6a1e883 to 5307cb8 Compare April 2, 2026 17:26
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from ea87d0a to 26e51f1 Compare April 2, 2026 17:26
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from 5307cb8 to f871386 Compare April 2, 2026 17:35
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 26e51f1 to de44836 Compare April 2, 2026 17:35
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from f871386 to ad8c70c Compare April 2, 2026 17:55
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from de44836 to c5695ff Compare April 2, 2026 19:12
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-id-propagation branch from ad8c70c to 789d06b Compare April 2, 2026 19:12
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from c5695ff to f8b283d Compare April 2, 2026 19:50
@nnshah1 nnshah1 requested a review from a team as a code owner April 2, 2026 19:50
@nnshah1 nnshah1 requested a review from a team April 2, 2026 19:50
@nnshah1 nnshah1 changed the base branch from main to nnshah1/DIS-1643-pr1-id-propagation April 2, 2026 20:15
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 215ab79 to fb3a884 Compare April 2, 2026 20:17
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

Walkthrough

The pull request extends request tracking and observability throughout the LLM service stack. Changes update the metrics infrastructure to store and expose request identifiers, add lifecycle logging (request received/completed), and thread request IDs through service handlers across HTTP and gRPC protocols to service endpoints including OpenAI, Anthropic, and Tensor. Disconnect monitoring logs are elevated from trace to warning level.

Changes

Cohort / File(s) Summary
Metrics Infrastructure
lib/llm/src/http/service/metrics.rs
Extended InflightGuard to store request_id; added public accessor methods (request_id(), model(), endpoint(), request_type(), error_type(), elapsed_ms()); introduced logging in constructor ("request received") and drop ("request completed") with structured metadata; added Display trait impls for RequestType and ErrorType; updated ResponseMetricCollector to include and store ISL.
HTTP Service Endpoints
lib/llm/src/http/service/openai.rs, lib/llm/src/http/service/anthropic.rs
Updated create_inflight_guard() invocations in multiple handlers (completions_single, completions_batch, chat_completions, embeddings, images, videos, audio_speech, anthropic_messages) to pass request identifier (request_id.clone() or ctx.id().to_string()) as fourth argument; no control flow or error handling changes.
gRPC Service Endpoints
lib/llm/src/grpc/service/openai.rs, lib/llm/src/grpc/service/tensor.rs
Updated create_inflight_guard() calls in completion_response_stream and tensor_response_stream to include request_id.clone() as additional parameter.
Disconnect Observability
lib/llm/src/http/service/disconnect.rs
Upgraded disconnect-related logs from trace to warning level in connection_monitor; enhanced monitor_for_disconnects to emit structured warning log when request is cancelled by client, including metadata from inflight_guard (request_id, model, endpoint, request_type, elapsed_ms).
Runtime Logging
lib/runtime/src/pipeline/network/ingress/push_handler.rs
Added info-level logging at request lifecycle boundaries in handle_payload: "request received" after decoding control message and "request completed" after successful processing.
Test Updates
lib/llm/tests/http_metrics.rs
Updated test cases to pass empty String as new request_id parameter to create_inflight_guard() signature.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description covers the Summary and Test plan sections but lacks the required Details, Where should the reviewer start, and Related Issues sections from the template. Add missing template sections: expand Details with specific implementation notes, identify key files for review (especially lib/llm/src/http/service/metrics.rs and lib/llm/src/http/service/disconnect.rs), and explicitly list related GitHub issue #DIS-1643 using Relates to keyword.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding request lifecycle logging via InflightGuard, which aligns with the primary objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
lib/llm/src/http/service/disconnect.rs (1)

138-153: Add structured fields to unexpected-close warnings.

Lines 138 and 153 currently log plain strings; include request_id, model, endpoint, and request_type for consistent correlation with other lifecycle logs.

Proposed refactor
-            tracing::warn!("Connection closed unexpectedly; issuing cancellation");
+            tracing::warn!(
+                request_id = %engine_context.id(),
+                model = %cancellation_labels.model.as_str(),
+                endpoint = %cancellation_labels.endpoint.as_str(),
+                request_type = %cancellation_labels.request_type.as_str(),
+                "connection closed unexpectedly; issuing cancellation"
+            );
...
-            tracing::warn!("Stream closed unexpectedly; issuing cancellation");
+            tracing::warn!(
+                request_id = %engine_context.id(),
+                model = %cancellation_labels.model.as_str(),
+                endpoint = %cancellation_labels.endpoint.as_str(),
+                request_type = %cancellation_labels.request_type.as_str(),
+                "stream closed unexpectedly; issuing cancellation"
+            );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/disconnect.rs` around lines 138 - 153, Update the
two tracing::warn calls that log unexpected closures to include structured
fields for correlation: replace the plain string warnings in the
ConnectionStatus::ClosedUnexpectedly match arm (the tracing::warn! that follows
metrics.inc_client_disconnect()/metrics.inc_cancellation() and
engine_context.kill()) and the tracing::warn! inside the match handling
stream_rx.await | Ok(ConnectionStatus::ClosedUnexpectedly) to pass request_id,
model, endpoint, and request_type as fields (e.g., tracing::warn!(request_id =
%request_id, model = %model, endpoint = %endpoint, request_type = %request_type,
"Connection closed unexpectedly; issuing cancellation") so the logs align with
other lifecycle logs and preserve the existing message and behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/llm/src/http/service/metrics.rs`:
- Around line 1050-1052: The current mapping for ErrorType::Cancelled in the
metrics/error_detail logic only says "client disconnected before completion"
which mislabels timeout-driven cancellations; update the ErrorType::Cancelled
branch (where `let detail = match self.error_type { ... }` is defined) to use a
broader description such as "request cancelled or timed out before completion"
(or similar wording that covers both client disconnect and timeout) so lifecycle
logs correctly reflect both cancellation causes; leave other ErrorType variants
unchanged.

In `@lib/runtime/src/pipeline/network/ingress/push_handler.rs`:
- Around line 366-369: The unconditional tracing::info!("request completed")
incorrectly logs success even for publish-failure/cancellation paths; change it
so completion is only logged on the successful completion path (e.g., after the
publish/send succeeds) and instead log failure/cancellation with appropriate
tracing::error or tracing::warn in the send-failure/cancel branches; locate the
tracing::info call in push_handler.rs (the spot using context.id()) and
move/remove it from the unconditional location and add targeted logs inside the
success branch and inside the send failure/cancellation branches so each outcome
is recorded accurately.

---

Nitpick comments:
In `@lib/llm/src/http/service/disconnect.rs`:
- Around line 138-153: Update the two tracing::warn calls that log unexpected
closures to include structured fields for correlation: replace the plain string
warnings in the ConnectionStatus::ClosedUnexpectedly match arm (the
tracing::warn! that follows
metrics.inc_client_disconnect()/metrics.inc_cancellation() and
engine_context.kill()) and the tracing::warn! inside the match handling
stream_rx.await | Ok(ConnectionStatus::ClosedUnexpectedly) to pass request_id,
model, endpoint, and request_type as fields (e.g., tracing::warn!(request_id =
%request_id, model = %model, endpoint = %endpoint, request_type = %request_type,
"Connection closed unexpectedly; issuing cancellation") so the logs align with
other lifecycle logs and preserve the existing message and behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c711e7a-3a08-4006-8af1-56bda5c90952

📥 Commits

Reviewing files that changed from the base of the PR and between 13058cb and 215ab79.

📒 Files selected for processing (8)
  • lib/llm/src/grpc/service/openai.rs
  • lib/llm/src/grpc/service/tensor.rs
  • lib/llm/src/http/service/anthropic.rs
  • lib/llm/src/http/service/disconnect.rs
  • lib/llm/src/http/service/metrics.rs
  • lib/llm/src/http/service/openai.rs
  • lib/llm/tests/http_metrics.rs
  • lib/runtime/src/pipeline/network/ingress/push_handler.rs

Comment on lines +1050 to +1052
let detail = match self.error_type {
ErrorType::Cancelled => "client disconnected before completion",
ErrorType::Internal => "internal server error during processing",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don't label every Cancelled request as a client disconnect.

ErrorType::Cancelled is documented as covering client cancellation and timeout cases, but the new error_detail only describes the client side. That will mislabel timeout paths in the lifecycle logs.

📝 Suggested wording
-                    ErrorType::Cancelled => "client disconnected before completion",
+                    ErrorType::Cancelled => "request cancelled or timed out before completion",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let detail = match self.error_type {
ErrorType::Cancelled => "client disconnected before completion",
ErrorType::Internal => "internal server error during processing",
let detail = match self.error_type {
ErrorType::Cancelled => "request cancelled or timed out before completion",
ErrorType::Internal => "internal server error during processing",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/llm/src/http/service/metrics.rs` around lines 1050 - 1052, The current
mapping for ErrorType::Cancelled in the metrics/error_detail logic only says
"client disconnected before completion" which mislabels timeout-driven
cancellations; update the ErrorType::Cancelled branch (where `let detail = match
self.error_type { ... }` is defined) to use a broader description such as
"request cancelled or timed out before completion" (or similar wording that
covers both client disconnect and timeout) so lifecycle logs correctly reflect
both cancellation causes; leave other ErrorType variants unchanged.

Comment on lines +366 to +369
tracing::info!(
request_id = %context.id(),
"request completed"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid unconditional "request completed" logging.

At Line 366, this log is emitted even after publish-failure/cancellation paths (e.g., send failure branch), which can misreport failed/cancelled flows as successful completions.

Proposed fix
-        tracing::info!(
-            request_id = %context.id(),
-            "request completed"
-        );
+        let elapsed_ms = start_time.elapsed().as_millis();
+        if send_complete_final {
+            tracing::info!(
+                request_id = %context.id(),
+                status = "success",
+                elapsed_ms = %elapsed_ms,
+                "request completed"
+            );
+        } else if context.is_stopped() {
+            tracing::warn!(
+                request_id = %context.id(),
+                status = "cancelled",
+                elapsed_ms = %elapsed_ms,
+                "request completed"
+            );
+        } else {
+            tracing::error!(
+                request_id = %context.id(),
+                status = "error",
+                elapsed_ms = %elapsed_ms,
+                "request completed"
+            );
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/runtime/src/pipeline/network/ingress/push_handler.rs` around lines 366 -
369, The unconditional tracing::info!("request completed") incorrectly logs
success even for publish-failure/cancellation paths; change it so completion is
only logged on the successful completion path (e.g., after the publish/send
succeeds) and instead log failure/cancellation with appropriate tracing::error
or tracing::warn in the send-failure/cancel branches; locate the tracing::info
call in push_handler.rs (the spot using context.id()) and move/remove it from
the unconditional location and add targeted logs inside the success branch and
inside the send failure/cancellation branches so each outcome is recorded
accurately.

@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from fb3a884 to 5657e03 Compare April 2, 2026 20:36
@nnshah1 nnshah1 changed the base branch from nnshah1/DIS-1643-pr1-id-propagation to nnshah1/DIS-1643-pr1-followup April 2, 2026 20:36
…namo-request-id

Follow-up to #7733. Changes missed from the clean rebuild:
- Unify on_response callback (error for 4xx/5xx, info for success) for
  both system and inference routes
- Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id
- Rename internal propagation header from x-dynamo-request-id to request-id
  (with backward compat fallback)
- Add UUID validation on TCP header path
- get_or_create_request_id returns String (warns on invalid UUID instead of 400)
- Add deprecation warning (DEP #7812) for x-dynamo-request-id header
- Add echo_request_id_header middleware
- make_system_request_span preserves trace context + generates request_id
- Remove duplicate x_dynamo_request_id span field from client_request spans

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-followup branch from eb1df1f to 140696c Compare April 2, 2026 20:59
Add "request received" (INFO) and "request completed" (INFO/ERROR) logs
to InflightGuard with structured fields (request_id, model, endpoint,
request_type, status, elapsed_ms). Add cancellation WARN in disconnect
monitor. Add worker lifecycle logs in push_handler. All call sites now
pass request_id to create_inflight_guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr2-lifecycle-logging branch from 5657e03 to 92676d5 Compare April 2, 2026 20:59
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr1-followup branch 4 times, most recently from db13cf6 to 9a9e8ff Compare April 2, 2026 22:11
Base automatically changed from nnshah1/DIS-1643-pr1-followup to main April 2, 2026 23:00
@pull-request-size pull-request-size Bot added size/XL and removed size/L labels Apr 2, 2026
@nnshah1
Copy link
Copy Markdown
Contributor Author

nnshah1 commented Apr 2, 2026

Superseded by combined PR #7840

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant