Skip to content

test: e2e JSONL log verification for request tracing [DIS-1643]#7766

Closed
nnshah1 wants to merge 1 commit into
nnshah1/DIS-1643-pr3-token-counts-cancellationfrom
nnshah1/DIS-1643-pr4-e2e-tests
Closed

test: e2e JSONL log verification for request tracing [DIS-1643]#7766
nnshah1 wants to merge 1 commit into
nnshah1/DIS-1643-pr3-token-counts-cancellationfrom
nnshah1/DIS-1643-pr4-e2e-tests

Conversation

@nnshah1
Copy link
Copy Markdown
Contributor

@nnshah1 nnshah1 commented Apr 1, 2026

Summary

11 parallel-safe e2e tests verifying JSONL log output for all request tracing scenarios. Tests run in 26s with -n auto.

Depends on: #7735

Tests

# Test Verifies
1 test_agg_unary_success INFO lifecycle: received → completed (success) → http sent (200)
2 test_agg_streaming_success Streaming: http sent at start, completed at end with tokens
3 test_agg_404_error ERROR completed (not_found) + ERROR http sent (404)
4 test_agg_400_invalid_uuid ERROR http sent (400), no request received
5 test_agg_request_id_propagation Same trace_id across frontend and worker
6 test_agg_cancellation Client disconnect → ERROR completed (cancelled)
7 test_agg_worker_crash Worker killed → ERROR completed (internal)
8 test_disagg_streaming_success Both prefill and decode workers log lifecycle
9 test_disagg_unary_success Both workers process request
10 test_disagg_prefill_crash Prefill killed → ERROR completed (internal, 500)
11 test_disagg_decode_crash Decode killed → ERROR completed (internal)

Example Log Output

Before (no structured request tracing)

WARN  request completed with client request error  status=400  latency_ms=0
ERROR Internal server error: Model not found

No request_id, no model, no trace correlation, no token counts.

After — Streaming Success

{"level":"INFO","message":"request received","request_id":"284f18c7-...","model":"qwen/qwen3-0.6b","endpoint":"chat_completions","request_type":"stream"}
{"level":"INFO","message":"http response sent","status":"200","latency_ms":"4"}
{"level":"INFO","message":"request completed","status":"success","elapsed_ms":"14","input_tokens":"9","output_tokens":"50","ttft_ms":"5.85","avg_itl_ms":"0.29"}

After — 404 Error

{"level":"INFO","message":"request received","request_id":"4644979b-...","model":"nonexistent-model","endpoint":"chat_completions"}
{"level":"ERROR","message":"request completed","status":"error","error_type":"not_found","error_detail":"model or resource not found","elapsed_ms":"0"}
{"level":"ERROR","message":"http response sent","status":"404","latency_ms":"0"}

After — Worker Crash (decode)

{"level":"INFO","message":"request received","request_id":"1d555fa1-...","model":"qwen/qwen3-0.6b","request_type":"stream"}
{"level":"ERROR","message":"request completed","status":"error","error_type":"internal","error_detail":"internal server error during processing","elapsed_ms":"556","input_tokens":"9","output_tokens":"4"}

After — Cancellation

{"level":"INFO","message":"request received","request_id":"8677080b-...","request_type":"stream"}
{"level":"ERROR","message":"request completed","status":"error","error_type":"cancelled","error_detail":"client disconnected before completion","elapsed_ms":"230"}
{"level":"WARN","message":"Stream closed unexpectedly; issuing cancellation"}

After — Worker (trace_id correlation)

{"level":"INFO","message":"request received","request_id":"6ec6ddd9-...","component":"backend","endpoint":"generate","instance_id":"2221573453717914121","trace_id":"3fe59d206530983e7fce5d405048f7c3"}
{"level":"INFO","message":"request completed","request_id":"6ec6ddd9-...","trace_id":"3fe59d206530983e7fce5d405048f7c3"}

🤖 Generated with Claude Code

Linear: DIS-1643

Summary by CodeRabbit

  • Tests
    • Added comprehensive request tracing lifecycle tests validating JSONL structured logging, trace correlation between frontend and worker services, error response handling, connection cancellation, worker failure scenarios, and both aggregated and disaggregated deployment configurations.

@nnshah1 nnshah1 requested a review from a team as a code owner April 1, 2026 14:05
@nnshah1 nnshah1 requested a review from a team April 1, 2026 14:05
@github-actions github-actions Bot added the test label Apr 1, 2026
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr4-e2e-tests branch from f4e135a to f5e0f17 Compare April 1, 2026 18:24
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr3-token-counts-cancellation branch from 9780c18 to ac29172 Compare April 1, 2026 18:24
@nnshah1
Copy link
Copy Markdown
Contributor Author

nnshah1 commented Apr 1, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

Walkthrough

A new end-to-end test module test_request_tracing_logs.py is introduced to validate JSONL structured logging and request tracing lifecycle fields. It provides fixtures for both aggregated and disaggregated worker modes and includes comprehensive test coverage for success paths, error scenarios, cancellation handling, and worker crash failures.

Changes

Cohort / File(s) Summary
End-to-End Request Tracing Tests
tests/frontend/test_request_tracing_logs.py
New test module with JsonlMockerWorkerProcess subclass, JSONL log parsing helper functions (parse_jsonl_logs, find_logs_by_request_id, read_log_file), fixtures for aggregated and disaggregated frontend+worker setups, and comprehensive tests covering unary/streaming success, HTTP error responses (404, 400), trace correlation, connection cancellation, and worker crash scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding end-to-end JSONL log verification tests for request tracing, with the associated ticket reference.
Description check ✅ Passed The description comprehensively covers the PR with a clear summary, detailed test matrix, example log outputs demonstrating before/after improvements, and proper issue/ticket linkage.
Docstring Coverage ✅ Passed Docstring coverage is 86.96% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/frontend/test_request_tracing_logs.py (2)

78-84: Prefer a polling log helper over snapshot reads.

The tests below read live child-process logs through this helper after fixed sleep(1/2/3) delays. That makes the suite racey on slower xdist runners and always adds wall time on fast ones. A bounded wait_for_log_entries(..., timeout) helper here would make the assertions much more stable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_request_tracing_logs.py` around lines 78 - 84, The
current read_log_file helper returns a static snapshot and tests use fixed sleep
delays, which leads to flaky timing; replace or augment read_log_file/usage with
a polling helper (e.g., wait_for_log_entries(process, predicate, timeout)) that
repeatedly reads process.log_path (via process.log_path and os.path.exists) in a
short loop with small sleeps until the predicate on the log content succeeds or
timeout elapses; implement the helper to return the log content or raise/return
None on timeout and update tests to call wait_for_log_entries instead of
sleeping so assertions wait deterministically and bounded by the timeout.

152-289: Collapse the tracing service bootstrap into one factory/parametrized fixture.

tracing_services* and tracing_services_disagg* repeat the same JSONL env setup, frontend startup, readiness wait, and yielded shape. That duplication makes future tracing/logging changes easy to fix in one path and miss in another.

As per coding guidelines, "Do not copy-paste test infrastructure; reuse and refactor shared setup logic into fixtures or tests/utils/, and use @pytest.mark.parametrize with indirect fixtures instead of separate test functions."

Also applies to: 682-733

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_request_tracing_logs.py` around lines 152 - 289,
Consolidate the three fixtures (tracing_services, tracing_services_slow,
tracing_services_disagg) into one parametrized factory fixture (e.g.,
tracing_services) that accepts modes like "normal", "slow", and "disagg" via
pytest.mark.parametrize indirect; move the shared jsonl_env creation,
DynamoFrontendProcess startup, wait_for_http_completions_ready(TEST_MODEL), and
common yield shape into that single fixture, and branch on the mode to set
JsonlMockerWorkerProcess args (speedup_ratio=0.1 for "slow"; extra_args and two
workers + allocate_port/deallocate_port for "disagg" using
allocate_port/deallocate_port and worker_id "prefill-worker"/"decode-worker");
ensure the fixture yields the same keys ("frontend_port","frontend","worker")
for normal/slow and
("frontend_port","frontend","prefill_worker","decode_worker") for disagg so
existing tests keep working.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/frontend/test_request_tracing_logs.py`:
- Line 587: Import threading at the top-level module imports instead of inside
individual test functions; remove the local "import threading" statements from
the three test functions (the ones around lines 587, 738, and 807) so the module
uses a single module-level "threading" import and avoid redundant local imports.
- Around line 473-494: The try/except block around the streaming request should
not swallow all exceptions; replace the broad "except Exception: pass" with
explicit handling for the expected streaming-disconnect errors (e.g.,
requests.exceptions.ChunkedEncodingError, requests.exceptions.ReadTimeout,
requests.exceptions.ConnectionError, BrokenPipeError/OSError) so unrelated
failures surface; locate the call to requests.post(...) and the loop using
resp.iter_lines() in test_request_tracing_logs.py and update the except clause
to only catch those specific exceptions and otherwise let errors propagate.
- Around line 36-42: pytestmark in tests/frontend/test_request_tracing_logs.py
currently lacks a timeout for long-running E2E operations; add a module-level
timeout marker (e.g., pytest.mark.timeout(180)) to the existing pytestmark list
so the test process/HTTP/polling loops cannot hang indefinitely; update the
pytestmark variable (the pytestmark list defined near the top of the file) to
include pytest.mark.timeout(...) alongside the existing markers.

---

Nitpick comments:
In `@tests/frontend/test_request_tracing_logs.py`:
- Around line 78-84: The current read_log_file helper returns a static snapshot
and tests use fixed sleep delays, which leads to flaky timing; replace or
augment read_log_file/usage with a polling helper (e.g.,
wait_for_log_entries(process, predicate, timeout)) that repeatedly reads
process.log_path (via process.log_path and os.path.exists) in a short loop with
small sleeps until the predicate on the log content succeeds or timeout elapses;
implement the helper to return the log content or raise/return None on timeout
and update tests to call wait_for_log_entries instead of sleeping so assertions
wait deterministically and bounded by the timeout.
- Around line 152-289: Consolidate the three fixtures (tracing_services,
tracing_services_slow, tracing_services_disagg) into one parametrized factory
fixture (e.g., tracing_services) that accepts modes like "normal", "slow", and
"disagg" via pytest.mark.parametrize indirect; move the shared jsonl_env
creation, DynamoFrontendProcess startup,
wait_for_http_completions_ready(TEST_MODEL), and common yield shape into that
single fixture, and branch on the mode to set JsonlMockerWorkerProcess args
(speedup_ratio=0.1 for "slow"; extra_args and two workers +
allocate_port/deallocate_port for "disagg" using allocate_port/deallocate_port
and worker_id "prefill-worker"/"decode-worker"); ensure the fixture yields the
same keys ("frontend_port","frontend","worker") for normal/slow and
("frontend_port","frontend","prefill_worker","decode_worker") for disagg so
existing tests keep working.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8a7e482-e25b-4c2d-a474-31c23a8686f8

📥 Commits

Reviewing files that changed from the base of the PR and between ac29172 and f5e0f17.

📒 Files selected for processing (1)
  • tests/frontend/test_request_tracing_logs.py

Comment on lines +36 to +42
pytestmark = [
pytest.mark.e2e,
pytest.mark.gpu_0,
pytest.mark.post_merge,
pytest.mark.parallel,
pytest.mark.model(TEST_MODEL),
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and get its size
wc -l tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 107


🏁 Script executed:

# Check the pytestmark definition and surrounding context
head -100 tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 3116


🏁 Script executed:

# Search for all pytest markers in the file, especially timeout
rg "pytest\.mark\." tests/frontend/test_request_tracing_logs.py -n

Repository: ai-dynamo/dynamo

Length of output: 207


🏁 Script executed:

# Search for time.sleep, subprocess, and other long operations
rg "time\.sleep|subprocess|\.Popen|\.request\(" tests/frontend/test_request_tracing_logs.py -n

Repository: ai-dynamo/dynamo

Length of output: 438


🏁 Script executed:

# Check individual test functions and their decorators/markers
rg "@pytest\.mark\.(timeout|pre_merge|post_merge|nightly)" tests/frontend/test_request_tracing_logs.py -B 5 -A 1 | head -150

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

# Get a list of all test function definitions
rg "^def test_" tests/frontend/test_request_tracing_logs.py -n

Repository: ai-dynamo/dynamo

Length of output: 770


🏁 Script executed:

# Check context around subprocess/service setup to understand test scope and duration
sed -n '150,300p' tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 5200


🏁 Script executed:

# Check if any individual tests have timeout markers
grep -n "@pytest.mark.timeout" tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

# Get actual test durations or estimates from fixture setup and operations
sed -n '290,350p' tests/frontend/test_request_tracing_logs.py | head -80

Repository: ai-dynamo/dynamo

Length of output: 2611


Add @pytest.mark.timeout(...) to cover E2E tests with subprocess and polling operations.

Module-level pytestmark includes scheduling/GPU/type markers but lacks a timeout marker. These tests spawn frontend/worker processes, make HTTP requests, and use time.sleep() for polling. Without timeouts, a wedged subprocess could hang the test suite indefinitely. Per guidelines, tests with long operations (network calls, subprocess waits, polling loops) must have explicit pytest.mark.timeout(...) set to 3× measured average (or conservatively 120–180 seconds for this E2E suite).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_request_tracing_logs.py` around lines 36 - 42, pytestmark
in tests/frontend/test_request_tracing_logs.py currently lacks a timeout for
long-running E2E operations; add a module-level timeout marker (e.g.,
pytest.mark.timeout(180)) to the existing pytestmark list so the test
process/HTTP/polling loops cannot hang indefinitely; update the pytestmark
variable (the pytestmark list defined near the top of the file) to include
pytest.mark.timeout(...) alongside the existing markers.

Comment on lines +473 to +494
try:
resp = requests.post(
f"http://localhost:{port}/v1/chat/completions",
headers={
"Content-Type": "application/json",
"x-dynamo-request-id": rid,
},
json={
"model": TEST_MODEL,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 2000,
"stream": True,
},
stream=True, # Don't download body eagerly
timeout=10,
)
# Read just enough to confirm stream started, then close
for _ in resp.iter_lines():
break # read one line then stop
resp.close() # Force TCP connection close
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if file exists and read the specified lines
if [ -f "tests/frontend/test_request_tracing_logs.py" ]; then
  echo "=== Lines 473-494 ==="
  sed -n '473,494p' tests/frontend/test_request_tracing_logs.py
  echo ""
  echo "=== Lines 756-775 ==="
  sed -n '756,775p' tests/frontend/test_request_tracing_logs.py
else
  echo "File not found"
fi

Repository: ai-dynamo/dynamo

Length of output: 1427


🏁 Script executed:

# Search for other crash-related tests or exception handling patterns in the file
echo "=== Searching for 'crash' or 'except' patterns ==="
grep -n "crash\|except.*Exception\|except.*requests\." tests/frontend/test_request_tracing_logs.py | head -30

Repository: ai-dynamo/dynamo

Length of output: 1191


🏁 Script executed:

# Check overall file structure to understand context
wc -l tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 107


🏁 Script executed:

# Get function context for lines 473-494
echo "=== Context around lines 473-494 (function name) ==="
sed -n '450,500p' tests/frontend/test_request_tracing_logs.py | head -60

echo ""
echo "=== Context around lines 756-775 (function name) ==="
sed -n '730,780p' tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 3732


🏁 Script executed:

# Check what these test functions are named and their purpose
grep -B 5 "resp = requests.post" tests/frontend/test_request_tracing_logs.py | grep -E "^[0-9]+-def test_|^[0-9]+-.*except"

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

# Read the test_agg_worker_crash function to see what it expects
sed -n '581,630p' tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 1961


🏁 Script executed:

# Read test_disagg_decode_crash to see the exception pattern
sed -n '805,850p' tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 1662


Catch only the expected streaming failures here.

except Exception: pass will swallow unrelated request/setup regressions and still let the test continue as if the intended disconnect/crash path happened. Mirror the explicit exception handling already used in other crash tests (test_agg_worker_crash, test_disagg_decode_crash) and let unexpected failures fail fast.

Minimal tightening
-    except Exception:
-        pass
+    except (requests.exceptions.ConnectionError, requests.exceptions.ChunkedEncodingError):
+        pass
🧰 Tools
🪛 Ruff (0.15.7)

[error] 493-494: try-except-pass detected, consider logging the exception

(S110)


[warning] 493-493: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_request_tracing_logs.py` around lines 473 - 494, The
try/except block around the streaming request should not swallow all exceptions;
replace the broad "except Exception: pass" with explicit handling for the
expected streaming-disconnect errors (e.g.,
requests.exceptions.ChunkedEncodingError, requests.exceptions.ReadTimeout,
requests.exceptions.ConnectionError, BrokenPipeError/OSError) so unrelated
failures surface; locate the call to requests.post(...) and the loop using
resp.iter_lines() in test_request_tracing_logs.py and update the except clause
to only catch those specific exceptions and otherwise let errors propagate.

worker = tracing_services_slow["worker"]
rid = str(uuid.uuid4())

import threading
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file structure and locate the threading imports
head -100 tests/frontend/test_request_tracing_logs.py | grep -n "import\|from" | head -20

Repository: ai-dynamo/dynamo

Length of output: 612


🏁 Script executed:

# Now let's check the specific lines mentioned: 587, 738, 807
sed -n '585,590p' tests/frontend/test_request_tracing_logs.py
sed -n '736,741p' tests/frontend/test_request_tracing_logs.py
sed -n '805,810p' tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 796


🏁 Script executed:

# Check if threading is already imported at the module level
grep -n "^import threading\|^from threading" tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 42


🏁 Script executed:

# Let's also get context around those lines to understand which function they belong to
sed -n '580,595p' tests/frontend/test_request_tracing_logs.py
sed -n '730,745p' tests/frontend/test_request_tracing_logs.py
sed -n '800,815p' tests/frontend/test_request_tracing_logs.py

Repository: ai-dynamo/dynamo

Length of output: 1830


Move threading to the module import block.

threading is imported inside three test function bodies (lines 587, 738, 807) rather than at the module level. Per coding guidelines, all imports must be at the top of the file. Move it once to the module-level imports to avoid redundant local imports and maintain consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/frontend/test_request_tracing_logs.py` at line 587, Import threading at
the top-level module imports instead of inside individual test functions; remove
the local "import threading" statements from the three test functions (the ones
around lines 587, 738, and 807) so the module uses a single module-level
"threading" import and avoid redundant local imports.

@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr3-token-counts-cancellation branch from ac29172 to 3ca1685 Compare April 1, 2026 19:44
@nnshah1 nnshah1 requested a review from a team as a code owner April 1, 2026 19:44
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr3-token-counts-cancellation branch from 3ca1685 to 42c8201 Compare April 1, 2026 19:47
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr4-e2e-tests branch from f5e0f17 to c898496 Compare April 1, 2026 19:47
@nnshah1 nnshah1 force-pushed the nnshah1/DIS-1643-pr3-token-counts-cancellation branch from 42c8201 to 0943ded Compare April 1, 2026 21:00
11 parallel-safe tests (25s with -n auto) covering:
- Aggregated: unary success, streaming success, 404, 400 invalid UUID,
  cancellation, worker crash, request ID propagation
- Disaggregated: streaming success, unary success, prefill crash, decode crash

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nnshah1
Copy link
Copy Markdown
Contributor Author

nnshah1 commented Apr 6, 2026

Superseded by #7817 (rebased on current main with all DIS-1643 changes merged).

@nnshah1 nnshah1 closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant