Skip to content

fix(otel): require two consecutive idle windows in DrainAsync to catch in-transit POSTs#5890

Merged
thomhurst merged 1 commit into
mainfrom
fix/ci-otlp-drain-race
May 11, 2026
Merged

fix(otel): require two consecutive idle windows in DrainAsync to catch in-transit POSTs#5890
thomhurst merged 1 commit into
mainfrom
fix/ci-otlp-drain-race

Conversation

@thomhurst
Copy link
Copy Markdown
Owner

Summary

Receiver_DrainAsync_WaitsForLatePostBeforeReturning failed on macOS arm64 in run 25639584797. Local repro on Windows showed a real bug: drain could return after just one ~250 ms idle window while a POST was still in transit between the kernel TCP queue and HttpListener.GetContextAsync — invisible to both _inflightTasks and TotalRequests.

Two changes

Receiver: DrainAsync now requires two consecutive idle stable windows (~500 ms minimum) before returning. A POST that's mid-flight at the start of drain has at least one full window to surface in the diagnostics / in-flight tracking.

Test: asserts the actual contract directly — the late POST's TaskCompletionSource is observably completed by the time drain returns — instead of a wall-clock floor. The previous >=350 ms floor was a fragile proxy that broke down when scheduling jitter reordered the late POST relative to drain start.

Test plan

  • Local Windows: 100/100 passes after the fix (was 97/100 before).
  • All 7 OtlpReceiverIngestionTests pass.
  • .NET workflow runs across all three OS legs.

…h in-transit POSTs

Run 25639584797 surfaced Receiver_DrainAsync_WaitsForLatePostBeforeReturning
failing on macOS arm64. Local repro on Windows: 3% failure rate over 100
iterations of the existing test, ~270ms drain elapsed.

The race: a request that's been sent over TCP but not yet pulled by
HttpListener.GetContextAsync is invisible to both _inflightTasks and
TotalRequests. There's no hook between the kernel TCP queue and the accept
loop, so a single 250ms idle window can return while a POST is mid-flight.
The original test caught this with a wall-clock floor (>=350ms), but that
floor itself was fragile and not actually checking the contract.

Two changes:

  - DrainAsync now requires two consecutive idle stable windows (~500ms
    minimum) before returning. A POST in transit at the start of drain
    has at least one full window to surface in TotalRequests / inflight.

  - The test now asserts the actual contract — the late POST's
    TaskCompletionSource was set before drain returned — instead of a
    wall-clock floor. The wall-clock floor was a proxy that broke down
    when scheduling jitter reordered the post relative to drain start.

Local: 100/100 passes after the fix.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The diagnosis is solid and the fix is well-reasoned. A few observations:

What works well

Test assertion improvement (OtlpReceiverIngestionTests.cs:222): Replacing drainElapsed >= 350ms with a TaskCompletionSource is a meaningful upgrade. The old assertion was measuring a proxy — it could pass even when the drain returned early if Task.Delay jitter happened to push the POST before drain start. The new assertion tests the actual invariant: the POST was fully acknowledged (response received + SetResult called) before drain declared quiet. That's strictly more correct.

Two-window logic (OtlpReceiver.cs:199): The TCP kernel queue → HttpListener.GetContextAsync gap is real and well-described. Requiring two consecutive idle windows is a pragmatic way to buy the in-transit request time to surface.


Concerns

1. Unconditional latency tax on truly-idle receivers

The two-window requirement now unconditionally doubles the minimum drain time (~250ms → ~500ms), even when no requests have ever arrived. Consider gating it on TotalRequests > 0:

// Only require two windows if at least one request was seen;
// a receiver that never got any traffic can exit after one idle window.
var requiredIdleWindows = Volatile.Read(ref _diagnostics.TotalRequests) > 0 ? 2 : 1;
...
if (++consecutiveIdleWindows >= requiredIdleWindows)
{
    return;
}

This avoids penalising test teardown paths where the receiver happened to receive nothing, while preserving the double-window guard for the cases that actually need it.

2. The heuristic is still raceable at the tail

Two windows of 250ms give ~500ms for an in-transit request to surface. On a heavily loaded CI machine, a request could theoretically be in the TCP queue for longer than 500ms. The PR description acknowledges this ("best-effort heuristic"), so it's not a blocker — but it's worth capturing in the DefaultDrainWindow docs that users with slow or batching exporters should raise TUNIT_OTLP_DRAIN_MS. The current XML doc already hints at this, so no change is strictly required.

3. consecutiveIdleWindows is implicitly coupled to stableFor

The comment says "~500ms" but that depends on stableFor = 250ms being stable. If someone later changes stableFor or makes it configurable, the "two windows" semantics silently changes. A minor defensive measure would be a named constant:

const int RequiredConsecutiveIdleWindows = 2;

...so any future reader changing stableFor knows to revisit the window count.

4. Test ordering (minor readability)

await Assert.That(latePostCompleted.Task.IsCompleted).IsTrue();
await Assert.That(receiver.Diagnostics.TracesRequests).IsEqualTo(1);

await latePost;  // cleanup after assertions

The await latePost after the assertions is intentional (assert first, clean up second), but it reads as though the task might still be running. A one-liner comment like // latePost already completed; await to propagate any exception would clarify the intent.


Summary

The core fix is correct and the test improvement is a clear win. The concerns above are mostly about robustness at the edges: (1) is the most actionable and avoids a small but real latency regression on idle receivers. The others are low-priority polish. Happy to approve once (1) is considered.

@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@thomhurst thomhurst merged commit 131eac2 into main May 11, 2026
13 of 14 checks passed
@thomhurst thomhurst deleted the fix/ci-otlp-drain-race branch May 11, 2026 22:08
@claude claude Bot mentioned this pull request May 14, 2026
1 task
This was referenced May 14, 2026
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant