test: e2e JSONL log verification for request tracing [DIS-1643]#7766
test: e2e JSONL log verification for request tracing [DIS-1643]#7766nnshah1 wants to merge 1 commit into
Conversation
f4e135a to
f5e0f17
Compare
9780c18 to
ac29172
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughA new end-to-end test module Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/frontend/test_request_tracing_logs.py (2)
78-84: Prefer a polling log helper over snapshot reads.The tests below read live child-process logs through this helper after fixed
sleep(1/2/3)delays. That makes the suite racey on slower xdist runners and always adds wall time on fast ones. A boundedwait_for_log_entries(..., timeout)helper here would make the assertions much more stable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/frontend/test_request_tracing_logs.py` around lines 78 - 84, The current read_log_file helper returns a static snapshot and tests use fixed sleep delays, which leads to flaky timing; replace or augment read_log_file/usage with a polling helper (e.g., wait_for_log_entries(process, predicate, timeout)) that repeatedly reads process.log_path (via process.log_path and os.path.exists) in a short loop with small sleeps until the predicate on the log content succeeds or timeout elapses; implement the helper to return the log content or raise/return None on timeout and update tests to call wait_for_log_entries instead of sleeping so assertions wait deterministically and bounded by the timeout.
152-289: Collapse the tracing service bootstrap into one factory/parametrized fixture.
tracing_services*andtracing_services_disagg*repeat the same JSONL env setup, frontend startup, readiness wait, and yielded shape. That duplication makes future tracing/logging changes easy to fix in one path and miss in another.As per coding guidelines, "Do not copy-paste test infrastructure; reuse and refactor shared setup logic into fixtures or
tests/utils/, and use@pytest.mark.parametrizewith indirect fixtures instead of separate test functions."Also applies to: 682-733
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/frontend/test_request_tracing_logs.py` around lines 152 - 289, Consolidate the three fixtures (tracing_services, tracing_services_slow, tracing_services_disagg) into one parametrized factory fixture (e.g., tracing_services) that accepts modes like "normal", "slow", and "disagg" via pytest.mark.parametrize indirect; move the shared jsonl_env creation, DynamoFrontendProcess startup, wait_for_http_completions_ready(TEST_MODEL), and common yield shape into that single fixture, and branch on the mode to set JsonlMockerWorkerProcess args (speedup_ratio=0.1 for "slow"; extra_args and two workers + allocate_port/deallocate_port for "disagg" using allocate_port/deallocate_port and worker_id "prefill-worker"/"decode-worker"); ensure the fixture yields the same keys ("frontend_port","frontend","worker") for normal/slow and ("frontend_port","frontend","prefill_worker","decode_worker") for disagg so existing tests keep working.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/frontend/test_request_tracing_logs.py`:
- Line 587: Import threading at the top-level module imports instead of inside
individual test functions; remove the local "import threading" statements from
the three test functions (the ones around lines 587, 738, and 807) so the module
uses a single module-level "threading" import and avoid redundant local imports.
- Around line 473-494: The try/except block around the streaming request should
not swallow all exceptions; replace the broad "except Exception: pass" with
explicit handling for the expected streaming-disconnect errors (e.g.,
requests.exceptions.ChunkedEncodingError, requests.exceptions.ReadTimeout,
requests.exceptions.ConnectionError, BrokenPipeError/OSError) so unrelated
failures surface; locate the call to requests.post(...) and the loop using
resp.iter_lines() in test_request_tracing_logs.py and update the except clause
to only catch those specific exceptions and otherwise let errors propagate.
- Around line 36-42: pytestmark in tests/frontend/test_request_tracing_logs.py
currently lacks a timeout for long-running E2E operations; add a module-level
timeout marker (e.g., pytest.mark.timeout(180)) to the existing pytestmark list
so the test process/HTTP/polling loops cannot hang indefinitely; update the
pytestmark variable (the pytestmark list defined near the top of the file) to
include pytest.mark.timeout(...) alongside the existing markers.
---
Nitpick comments:
In `@tests/frontend/test_request_tracing_logs.py`:
- Around line 78-84: The current read_log_file helper returns a static snapshot
and tests use fixed sleep delays, which leads to flaky timing; replace or
augment read_log_file/usage with a polling helper (e.g.,
wait_for_log_entries(process, predicate, timeout)) that repeatedly reads
process.log_path (via process.log_path and os.path.exists) in a short loop with
small sleeps until the predicate on the log content succeeds or timeout elapses;
implement the helper to return the log content or raise/return None on timeout
and update tests to call wait_for_log_entries instead of sleeping so assertions
wait deterministically and bounded by the timeout.
- Around line 152-289: Consolidate the three fixtures (tracing_services,
tracing_services_slow, tracing_services_disagg) into one parametrized factory
fixture (e.g., tracing_services) that accepts modes like "normal", "slow", and
"disagg" via pytest.mark.parametrize indirect; move the shared jsonl_env
creation, DynamoFrontendProcess startup,
wait_for_http_completions_ready(TEST_MODEL), and common yield shape into that
single fixture, and branch on the mode to set JsonlMockerWorkerProcess args
(speedup_ratio=0.1 for "slow"; extra_args and two workers +
allocate_port/deallocate_port for "disagg" using allocate_port/deallocate_port
and worker_id "prefill-worker"/"decode-worker"); ensure the fixture yields the
same keys ("frontend_port","frontend","worker") for normal/slow and
("frontend_port","frontend","prefill_worker","decode_worker") for disagg so
existing tests keep working.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e8a7e482-e25b-4c2d-a474-31c23a8686f8
📒 Files selected for processing (1)
tests/frontend/test_request_tracing_logs.py
| pytestmark = [ | ||
| pytest.mark.e2e, | ||
| pytest.mark.gpu_0, | ||
| pytest.mark.post_merge, | ||
| pytest.mark.parallel, | ||
| pytest.mark.model(TEST_MODEL), | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get its size
wc -l tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 107
🏁 Script executed:
# Check the pytestmark definition and surrounding context
head -100 tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 3116
🏁 Script executed:
# Search for all pytest markers in the file, especially timeout
rg "pytest\.mark\." tests/frontend/test_request_tracing_logs.py -nRepository: ai-dynamo/dynamo
Length of output: 207
🏁 Script executed:
# Search for time.sleep, subprocess, and other long operations
rg "time\.sleep|subprocess|\.Popen|\.request\(" tests/frontend/test_request_tracing_logs.py -nRepository: ai-dynamo/dynamo
Length of output: 438
🏁 Script executed:
# Check individual test functions and their decorators/markers
rg "@pytest\.mark\.(timeout|pre_merge|post_merge|nightly)" tests/frontend/test_request_tracing_logs.py -B 5 -A 1 | head -150Repository: ai-dynamo/dynamo
Length of output: 42
🏁 Script executed:
# Get a list of all test function definitions
rg "^def test_" tests/frontend/test_request_tracing_logs.py -nRepository: ai-dynamo/dynamo
Length of output: 770
🏁 Script executed:
# Check context around subprocess/service setup to understand test scope and duration
sed -n '150,300p' tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 5200
🏁 Script executed:
# Check if any individual tests have timeout markers
grep -n "@pytest.mark.timeout" tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 42
🏁 Script executed:
# Get actual test durations or estimates from fixture setup and operations
sed -n '290,350p' tests/frontend/test_request_tracing_logs.py | head -80Repository: ai-dynamo/dynamo
Length of output: 2611
Add @pytest.mark.timeout(...) to cover E2E tests with subprocess and polling operations.
Module-level pytestmark includes scheduling/GPU/type markers but lacks a timeout marker. These tests spawn frontend/worker processes, make HTTP requests, and use time.sleep() for polling. Without timeouts, a wedged subprocess could hang the test suite indefinitely. Per guidelines, tests with long operations (network calls, subprocess waits, polling loops) must have explicit pytest.mark.timeout(...) set to 3× measured average (or conservatively 120–180 seconds for this E2E suite).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/frontend/test_request_tracing_logs.py` around lines 36 - 42, pytestmark
in tests/frontend/test_request_tracing_logs.py currently lacks a timeout for
long-running E2E operations; add a module-level timeout marker (e.g.,
pytest.mark.timeout(180)) to the existing pytestmark list so the test
process/HTTP/polling loops cannot hang indefinitely; update the pytestmark
variable (the pytestmark list defined near the top of the file) to include
pytest.mark.timeout(...) alongside the existing markers.
| try: | ||
| resp = requests.post( | ||
| f"http://localhost:{port}/v1/chat/completions", | ||
| headers={ | ||
| "Content-Type": "application/json", | ||
| "x-dynamo-request-id": rid, | ||
| }, | ||
| json={ | ||
| "model": TEST_MODEL, | ||
| "messages": [{"role": "user", "content": "Hello"}], | ||
| "max_tokens": 2000, | ||
| "stream": True, | ||
| }, | ||
| stream=True, # Don't download body eagerly | ||
| timeout=10, | ||
| ) | ||
| # Read just enough to confirm stream started, then close | ||
| for _ in resp.iter_lines(): | ||
| break # read one line then stop | ||
| resp.close() # Force TCP connection close | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if file exists and read the specified lines
if [ -f "tests/frontend/test_request_tracing_logs.py" ]; then
echo "=== Lines 473-494 ==="
sed -n '473,494p' tests/frontend/test_request_tracing_logs.py
echo ""
echo "=== Lines 756-775 ==="
sed -n '756,775p' tests/frontend/test_request_tracing_logs.py
else
echo "File not found"
fiRepository: ai-dynamo/dynamo
Length of output: 1427
🏁 Script executed:
# Search for other crash-related tests or exception handling patterns in the file
echo "=== Searching for 'crash' or 'except' patterns ==="
grep -n "crash\|except.*Exception\|except.*requests\." tests/frontend/test_request_tracing_logs.py | head -30Repository: ai-dynamo/dynamo
Length of output: 1191
🏁 Script executed:
# Check overall file structure to understand context
wc -l tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 107
🏁 Script executed:
# Get function context for lines 473-494
echo "=== Context around lines 473-494 (function name) ==="
sed -n '450,500p' tests/frontend/test_request_tracing_logs.py | head -60
echo ""
echo "=== Context around lines 756-775 (function name) ==="
sed -n '730,780p' tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 3732
🏁 Script executed:
# Check what these test functions are named and their purpose
grep -B 5 "resp = requests.post" tests/frontend/test_request_tracing_logs.py | grep -E "^[0-9]+-def test_|^[0-9]+-.*except"Repository: ai-dynamo/dynamo
Length of output: 42
🏁 Script executed:
# Read the test_agg_worker_crash function to see what it expects
sed -n '581,630p' tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 1961
🏁 Script executed:
# Read test_disagg_decode_crash to see the exception pattern
sed -n '805,850p' tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 1662
Catch only the expected streaming failures here.
except Exception: pass will swallow unrelated request/setup regressions and still let the test continue as if the intended disconnect/crash path happened. Mirror the explicit exception handling already used in other crash tests (test_agg_worker_crash, test_disagg_decode_crash) and let unexpected failures fail fast.
Minimal tightening
- except Exception:
- pass
+ except (requests.exceptions.ConnectionError, requests.exceptions.ChunkedEncodingError):
+ pass🧰 Tools
🪛 Ruff (0.15.7)
[error] 493-494: try-except-pass detected, consider logging the exception
(S110)
[warning] 493-493: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/frontend/test_request_tracing_logs.py` around lines 473 - 494, The
try/except block around the streaming request should not swallow all exceptions;
replace the broad "except Exception: pass" with explicit handling for the
expected streaming-disconnect errors (e.g.,
requests.exceptions.ChunkedEncodingError, requests.exceptions.ReadTimeout,
requests.exceptions.ConnectionError, BrokenPipeError/OSError) so unrelated
failures surface; locate the call to requests.post(...) and the loop using
resp.iter_lines() in test_request_tracing_logs.py and update the except clause
to only catch those specific exceptions and otherwise let errors propagate.
| worker = tracing_services_slow["worker"] | ||
| rid = str(uuid.uuid4()) | ||
|
|
||
| import threading |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file structure and locate the threading imports
head -100 tests/frontend/test_request_tracing_logs.py | grep -n "import\|from" | head -20Repository: ai-dynamo/dynamo
Length of output: 612
🏁 Script executed:
# Now let's check the specific lines mentioned: 587, 738, 807
sed -n '585,590p' tests/frontend/test_request_tracing_logs.py
sed -n '736,741p' tests/frontend/test_request_tracing_logs.py
sed -n '805,810p' tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 796
🏁 Script executed:
# Check if threading is already imported at the module level
grep -n "^import threading\|^from threading" tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 42
🏁 Script executed:
# Let's also get context around those lines to understand which function they belong to
sed -n '580,595p' tests/frontend/test_request_tracing_logs.py
sed -n '730,745p' tests/frontend/test_request_tracing_logs.py
sed -n '800,815p' tests/frontend/test_request_tracing_logs.pyRepository: ai-dynamo/dynamo
Length of output: 1830
Move threading to the module import block.
threading is imported inside three test function bodies (lines 587, 738, 807) rather than at the module level. Per coding guidelines, all imports must be at the top of the file. Move it once to the module-level imports to avoid redundant local imports and maintain consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/frontend/test_request_tracing_logs.py` at line 587, Import threading at
the top-level module imports instead of inside individual test functions; remove
the local "import threading" statements from the three test functions (the ones
around lines 587, 738, and 807) so the module uses a single module-level
"threading" import and avoid redundant local imports.
ac29172 to
3ca1685
Compare
3ca1685 to
42c8201
Compare
f5e0f17 to
c898496
Compare
42c8201 to
0943ded
Compare
11 parallel-safe tests (25s with -n auto) covering: - Aggregated: unary success, streaming success, 404, 400 invalid UUID, cancellation, worker crash, request ID propagation - Disaggregated: streaming success, unary success, prefill crash, decode crash Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c898496 to
12b9d7a
Compare
Summary
11 parallel-safe e2e tests verifying JSONL log output for all request tracing scenarios. Tests run in 26s with
-n auto.Depends on: #7735
Tests
test_agg_unary_successtest_agg_streaming_successtest_agg_404_errortest_agg_400_invalid_uuidtest_agg_request_id_propagationtest_agg_cancellationtest_agg_worker_crashtest_disagg_streaming_successtest_disagg_unary_successtest_disagg_prefill_crashtest_disagg_decode_crashExample Log Output
Before (no structured request tracing)
No request_id, no model, no trace correlation, no token counts.
After — Streaming Success
{"level":"INFO","message":"request received","request_id":"284f18c7-...","model":"qwen/qwen3-0.6b","endpoint":"chat_completions","request_type":"stream"} {"level":"INFO","message":"http response sent","status":"200","latency_ms":"4"} {"level":"INFO","message":"request completed","status":"success","elapsed_ms":"14","input_tokens":"9","output_tokens":"50","ttft_ms":"5.85","avg_itl_ms":"0.29"}After — 404 Error
{"level":"INFO","message":"request received","request_id":"4644979b-...","model":"nonexistent-model","endpoint":"chat_completions"} {"level":"ERROR","message":"request completed","status":"error","error_type":"not_found","error_detail":"model or resource not found","elapsed_ms":"0"} {"level":"ERROR","message":"http response sent","status":"404","latency_ms":"0"}After — Worker Crash (decode)
{"level":"INFO","message":"request received","request_id":"1d555fa1-...","model":"qwen/qwen3-0.6b","request_type":"stream"} {"level":"ERROR","message":"request completed","status":"error","error_type":"internal","error_detail":"internal server error during processing","elapsed_ms":"556","input_tokens":"9","output_tokens":"4"}After — Cancellation
{"level":"INFO","message":"request received","request_id":"8677080b-...","request_type":"stream"} {"level":"ERROR","message":"request completed","status":"error","error_type":"cancelled","error_detail":"client disconnected before completion","elapsed_ms":"230"} {"level":"WARN","message":"Stream closed unexpectedly; issuing cancellation"}After — Worker (trace_id correlation)
{"level":"INFO","message":"request received","request_id":"6ec6ddd9-...","component":"backend","endpoint":"generate","instance_id":"2221573453717914121","trace_id":"3fe59d206530983e7fce5d405048f7c3"} {"level":"INFO","message":"request completed","request_id":"6ec6ddd9-...","trace_id":"3fe59d206530983e7fce5d405048f7c3"}🤖 Generated with Claude Code
Linear: DIS-1643
Summary by CodeRabbit