feat: request lifecycle logging via InflightGuard#7815
Conversation
2564a0d to
26a7f55
Compare
26a7f55 to
9c449c6
Compare
9c449c6 to
5a81d75
Compare
5a81d75 to
613f3bd
Compare
d645665 to
8f3cfd4
Compare
613f3bd to
ea87d0a
Compare
8f3cfd4 to
6a1e883
Compare
6a1e883 to
5307cb8
Compare
ea87d0a to
26e51f1
Compare
5307cb8 to
f871386
Compare
26e51f1 to
de44836
Compare
f871386 to
ad8c70c
Compare
de44836 to
c5695ff
Compare
ad8c70c to
789d06b
Compare
c5695ff to
f8b283d
Compare
215ab79 to
fb3a884
Compare
WalkthroughThe pull request extends request tracking and observability throughout the LLM service stack. Changes update the metrics infrastructure to store and expose request identifiers, add lifecycle logging (request received/completed), and thread request IDs through service handlers across HTTP and gRPC protocols to service endpoints including OpenAI, Anthropic, and Tensor. Disconnect monitoring logs are elevated from trace to warning level. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
lib/llm/src/http/service/disconnect.rs (1)
138-153: Add structured fields to unexpected-close warnings.Lines 138 and 153 currently log plain strings; include
request_id,model,endpoint, andrequest_typefor consistent correlation with other lifecycle logs.Proposed refactor
- tracing::warn!("Connection closed unexpectedly; issuing cancellation"); + tracing::warn!( + request_id = %engine_context.id(), + model = %cancellation_labels.model.as_str(), + endpoint = %cancellation_labels.endpoint.as_str(), + request_type = %cancellation_labels.request_type.as_str(), + "connection closed unexpectedly; issuing cancellation" + ); ... - tracing::warn!("Stream closed unexpectedly; issuing cancellation"); + tracing::warn!( + request_id = %engine_context.id(), + model = %cancellation_labels.model.as_str(), + endpoint = %cancellation_labels.endpoint.as_str(), + request_type = %cancellation_labels.request_type.as_str(), + "stream closed unexpectedly; issuing cancellation" + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/http/service/disconnect.rs` around lines 138 - 153, Update the two tracing::warn calls that log unexpected closures to include structured fields for correlation: replace the plain string warnings in the ConnectionStatus::ClosedUnexpectedly match arm (the tracing::warn! that follows metrics.inc_client_disconnect()/metrics.inc_cancellation() and engine_context.kill()) and the tracing::warn! inside the match handling stream_rx.await | Ok(ConnectionStatus::ClosedUnexpectedly) to pass request_id, model, endpoint, and request_type as fields (e.g., tracing::warn!(request_id = %request_id, model = %model, endpoint = %endpoint, request_type = %request_type, "Connection closed unexpectedly; issuing cancellation") so the logs align with other lifecycle logs and preserve the existing message and behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/llm/src/http/service/metrics.rs`:
- Around line 1050-1052: The current mapping for ErrorType::Cancelled in the
metrics/error_detail logic only says "client disconnected before completion"
which mislabels timeout-driven cancellations; update the ErrorType::Cancelled
branch (where `let detail = match self.error_type { ... }` is defined) to use a
broader description such as "request cancelled or timed out before completion"
(or similar wording that covers both client disconnect and timeout) so lifecycle
logs correctly reflect both cancellation causes; leave other ErrorType variants
unchanged.
In `@lib/runtime/src/pipeline/network/ingress/push_handler.rs`:
- Around line 366-369: The unconditional tracing::info!("request completed")
incorrectly logs success even for publish-failure/cancellation paths; change it
so completion is only logged on the successful completion path (e.g., after the
publish/send succeeds) and instead log failure/cancellation with appropriate
tracing::error or tracing::warn in the send-failure/cancel branches; locate the
tracing::info call in push_handler.rs (the spot using context.id()) and
move/remove it from the unconditional location and add targeted logs inside the
success branch and inside the send failure/cancellation branches so each outcome
is recorded accurately.
---
Nitpick comments:
In `@lib/llm/src/http/service/disconnect.rs`:
- Around line 138-153: Update the two tracing::warn calls that log unexpected
closures to include structured fields for correlation: replace the plain string
warnings in the ConnectionStatus::ClosedUnexpectedly match arm (the
tracing::warn! that follows
metrics.inc_client_disconnect()/metrics.inc_cancellation() and
engine_context.kill()) and the tracing::warn! inside the match handling
stream_rx.await | Ok(ConnectionStatus::ClosedUnexpectedly) to pass request_id,
model, endpoint, and request_type as fields (e.g., tracing::warn!(request_id =
%request_id, model = %model, endpoint = %endpoint, request_type = %request_type,
"Connection closed unexpectedly; issuing cancellation") so the logs align with
other lifecycle logs and preserve the existing message and behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5c711e7a-3a08-4006-8af1-56bda5c90952
📒 Files selected for processing (8)
lib/llm/src/grpc/service/openai.rslib/llm/src/grpc/service/tensor.rslib/llm/src/http/service/anthropic.rslib/llm/src/http/service/disconnect.rslib/llm/src/http/service/metrics.rslib/llm/src/http/service/openai.rslib/llm/tests/http_metrics.rslib/runtime/src/pipeline/network/ingress/push_handler.rs
| let detail = match self.error_type { | ||
| ErrorType::Cancelled => "client disconnected before completion", | ||
| ErrorType::Internal => "internal server error during processing", |
There was a problem hiding this comment.
Don't label every Cancelled request as a client disconnect.
ErrorType::Cancelled is documented as covering client cancellation and timeout cases, but the new error_detail only describes the client side. That will mislabel timeout paths in the lifecycle logs.
📝 Suggested wording
- ErrorType::Cancelled => "client disconnected before completion",
+ ErrorType::Cancelled => "request cancelled or timed out before completion",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let detail = match self.error_type { | |
| ErrorType::Cancelled => "client disconnected before completion", | |
| ErrorType::Internal => "internal server error during processing", | |
| let detail = match self.error_type { | |
| ErrorType::Cancelled => "request cancelled or timed out before completion", | |
| ErrorType::Internal => "internal server error during processing", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/llm/src/http/service/metrics.rs` around lines 1050 - 1052, The current
mapping for ErrorType::Cancelled in the metrics/error_detail logic only says
"client disconnected before completion" which mislabels timeout-driven
cancellations; update the ErrorType::Cancelled branch (where `let detail = match
self.error_type { ... }` is defined) to use a broader description such as
"request cancelled or timed out before completion" (or similar wording that
covers both client disconnect and timeout) so lifecycle logs correctly reflect
both cancellation causes; leave other ErrorType variants unchanged.
| tracing::info!( | ||
| request_id = %context.id(), | ||
| "request completed" | ||
| ); |
There was a problem hiding this comment.
Avoid unconditional "request completed" logging.
At Line 366, this log is emitted even after publish-failure/cancellation paths (e.g., send failure branch), which can misreport failed/cancelled flows as successful completions.
Proposed fix
- tracing::info!(
- request_id = %context.id(),
- "request completed"
- );
+ let elapsed_ms = start_time.elapsed().as_millis();
+ if send_complete_final {
+ tracing::info!(
+ request_id = %context.id(),
+ status = "success",
+ elapsed_ms = %elapsed_ms,
+ "request completed"
+ );
+ } else if context.is_stopped() {
+ tracing::warn!(
+ request_id = %context.id(),
+ status = "cancelled",
+ elapsed_ms = %elapsed_ms,
+ "request completed"
+ );
+ } else {
+ tracing::error!(
+ request_id = %context.id(),
+ status = "error",
+ elapsed_ms = %elapsed_ms,
+ "request completed"
+ );
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/runtime/src/pipeline/network/ingress/push_handler.rs` around lines 366 -
369, The unconditional tracing::info!("request completed") incorrectly logs
success even for publish-failure/cancellation paths; change it so completion is
only logged on the successful completion path (e.g., after the publish/send
succeeds) and instead log failure/cancellation with appropriate tracing::error
or tracing::warn in the send-failure/cancel branches; locate the tracing::info
call in push_handler.rs (the spot using context.id()) and move/remove it from
the unconditional location and add targeted logs inside the success branch and
inside the send failure/cancellation branches so each outcome is recorded
accurately.
fb3a884 to
5657e03
Compare
…namo-request-id Follow-up to #7733. Changes missed from the clean rebuild: - Unify on_response callback (error for 4xx/5xx, info for success) for both system and inference routes - Rename TraceParent/DistributedTraceContext field x_dynamo_request_id → request_id - Rename internal propagation header from x-dynamo-request-id to request-id (with backward compat fallback) - Add UUID validation on TCP header path - get_or_create_request_id returns String (warns on invalid UUID instead of 400) - Add deprecation warning (DEP #7812) for x-dynamo-request-id header - Add echo_request_id_header middleware - make_system_request_span preserves trace context + generates request_id - Remove duplicate x_dynamo_request_id span field from client_request spans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eb1df1f to
140696c
Compare
Add "request received" (INFO) and "request completed" (INFO/ERROR) logs to InflightGuard with structured fields (request_id, model, endpoint, request_type, status, elapsed_ms). Add cancellation WARN in disconnect monitor. Add worker lifecycle logs in push_handler. All call sites now pass request_id to create_inflight_guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5657e03 to
92676d5
Compare
db13cf6 to
9a9e8ff
Compare
|
Superseded by combined PR #7840 |
Summary
request_idfield toInflightGuardwith structured "request received" (INFO) and "request completed" (INFO/ERROR) lifecycle logsDisplayimpls forRequestType/ErrorTypetrace\!towarn\!with structured fields indisconnect.rspush_handler.rscreate_inflight_guardcall sites now passrequest_idTest plan
cargo check --workspacepassescargo clippy -p dynamo-llm -p dynamo-runtime --no-depscleanPart 2 of 4 for DIS-1643: Consistent Error Tracing
Depends on #7814
🤖 Generated with Claude Code
Summary by CodeRabbit