fix(otel): require two consecutive idle windows in DrainAsync to catch in-transit POSTs#5890
Conversation
…h in-transit POSTs
Run 25639584797 surfaced Receiver_DrainAsync_WaitsForLatePostBeforeReturning
failing on macOS arm64. Local repro on Windows: 3% failure rate over 100
iterations of the existing test, ~270ms drain elapsed.
The race: a request that's been sent over TCP but not yet pulled by
HttpListener.GetContextAsync is invisible to both _inflightTasks and
TotalRequests. There's no hook between the kernel TCP queue and the accept
loop, so a single 250ms idle window can return while a POST is mid-flight.
The original test caught this with a wall-clock floor (>=350ms), but that
floor itself was fragile and not actually checking the contract.
Two changes:
- DrainAsync now requires two consecutive idle stable windows (~500ms
minimum) before returning. A POST in transit at the start of drain
has at least one full window to surface in TotalRequests / inflight.
- The test now asserts the actual contract — the late POST's
TaskCompletionSource was set before drain returned — instead of a
wall-clock floor. The wall-clock floor was a proxy that broke down
when scheduling jitter reordered the post relative to drain start.
Local: 100/100 passes after the fix.
There was a problem hiding this comment.
Code Review
The diagnosis is solid and the fix is well-reasoned. A few observations:
What works well
Test assertion improvement (OtlpReceiverIngestionTests.cs:222): Replacing drainElapsed >= 350ms with a TaskCompletionSource is a meaningful upgrade. The old assertion was measuring a proxy — it could pass even when the drain returned early if Task.Delay jitter happened to push the POST before drain start. The new assertion tests the actual invariant: the POST was fully acknowledged (response received + SetResult called) before drain declared quiet. That's strictly more correct.
Two-window logic (OtlpReceiver.cs:199): The TCP kernel queue → HttpListener.GetContextAsync gap is real and well-described. Requiring two consecutive idle windows is a pragmatic way to buy the in-transit request time to surface.
Concerns
1. Unconditional latency tax on truly-idle receivers
The two-window requirement now unconditionally doubles the minimum drain time (~250ms → ~500ms), even when no requests have ever arrived. Consider gating it on TotalRequests > 0:
// Only require two windows if at least one request was seen;
// a receiver that never got any traffic can exit after one idle window.
var requiredIdleWindows = Volatile.Read(ref _diagnostics.TotalRequests) > 0 ? 2 : 1;
...
if (++consecutiveIdleWindows >= requiredIdleWindows)
{
return;
}This avoids penalising test teardown paths where the receiver happened to receive nothing, while preserving the double-window guard for the cases that actually need it.
2. The heuristic is still raceable at the tail
Two windows of 250ms give ~500ms for an in-transit request to surface. On a heavily loaded CI machine, a request could theoretically be in the TCP queue for longer than 500ms. The PR description acknowledges this ("best-effort heuristic"), so it's not a blocker — but it's worth capturing in the DefaultDrainWindow docs that users with slow or batching exporters should raise TUNIT_OTLP_DRAIN_MS. The current XML doc already hints at this, so no change is strictly required.
3. consecutiveIdleWindows is implicitly coupled to stableFor
The comment says "~500ms" but that depends on stableFor = 250ms being stable. If someone later changes stableFor or makes it configurable, the "two windows" semantics silently changes. A minor defensive measure would be a named constant:
const int RequiredConsecutiveIdleWindows = 2;...so any future reader changing stableFor knows to revisit the window count.
4. Test ordering (minor readability)
await Assert.That(latePostCompleted.Task.IsCompleted).IsTrue();
await Assert.That(receiver.Diagnostics.TracesRequests).IsEqualTo(1);
await latePost; // cleanup after assertionsThe await latePost after the assertions is intentional (assert first, clean up second), but it reads as though the task might still be running. A one-liner comment like // latePost already completed; await to propagate any exception would clarify the intent.
Summary
The core fix is correct and the test improvement is a clear win. The concerns above are mostly about robustness at the edges: (1) is the most actionable and avoids a small but real latency regression on idle receivers. The others are low-priority polish. Happy to approve once (1) is considered.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Summary
Receiver_DrainAsync_WaitsForLatePostBeforeReturningfailed on macOS arm64 in run 25639584797. Local repro on Windows showed a real bug: drain could return after just one ~250 ms idle window while a POST was still in transit between the kernel TCP queue andHttpListener.GetContextAsync— invisible to both_inflightTasksandTotalRequests.Two changes
Receiver:
DrainAsyncnow requires two consecutive idle stable windows (~500 ms minimum) before returning. A POST that's mid-flight at the start of drain has at least one full window to surface in the diagnostics / in-flight tracking.Test: asserts the actual contract directly — the late POST's
TaskCompletionSourceis observably completed by the time drain returns — instead of a wall-clock floor. The previous>=350 msfloor was a fragile proxy that broke down when scheduling jitter reordered the late POST relative to drain start.Test plan
OtlpReceiverIngestionTestspass..NETworkflow runs across all three OS legs.