fix(otel): require two consecutive idle windows in DrainAsync to catch in-transit POSTs by thomhurst · Pull Request #5890 · thomhurst/TUnit

thomhurst · 2026-05-11T11:02:03Z

Summary

Receiver_DrainAsync_WaitsForLatePostBeforeReturning failed on macOS arm64 in run 25639584797. Local repro on Windows showed a real bug: drain could return after just one ~250 ms idle window while a POST was still in transit between the kernel TCP queue and HttpListener.GetContextAsync — invisible to both _inflightTasks and TotalRequests.

Two changes

Receiver: DrainAsync now requires two consecutive idle stable windows (~500 ms minimum) before returning. A POST that's mid-flight at the start of drain has at least one full window to surface in the diagnostics / in-flight tracking.

Test: asserts the actual contract directly — the late POST's TaskCompletionSource is observably completed by the time drain returns — instead of a wall-clock floor. The previous >=350 ms floor was a fragile proxy that broke down when scheduling jitter reordered the late POST relative to drain start.

Test plan

Local Windows: 100/100 passes after the fix (was 97/100 before).
All 7 OtlpReceiverIngestionTests pass.
.NET workflow runs across all three OS legs.

…h in-transit POSTs Run 25639584797 surfaced Receiver_DrainAsync_WaitsForLatePostBeforeReturning failing on macOS arm64. Local repro on Windows: 3% failure rate over 100 iterations of the existing test, ~270ms drain elapsed. The race: a request that's been sent over TCP but not yet pulled by HttpListener.GetContextAsync is invisible to both _inflightTasks and TotalRequests. There's no hook between the kernel TCP queue and the accept loop, so a single 250ms idle window can return while a POST is mid-flight. The original test caught this with a wall-clock floor (>=350ms), but that floor itself was fragile and not actually checking the contract. Two changes: - DrainAsync now requires two consecutive idle stable windows (~500ms minimum) before returning. A POST in transit at the start of drain has at least one full window to surface in TotalRequests / inflight. - The test now asserts the actual contract — the late POST's TaskCompletionSource was set before drain returned — instead of a wall-clock floor. The wall-clock floor was a proxy that broke down when scheduling jitter reordered the post relative to drain start. Local: 100/100 passes after the fix.

claude

Code Review

The diagnosis is solid and the fix is well-reasoned. A few observations:

What works well

Test assertion improvement (OtlpReceiverIngestionTests.cs:222): Replacing drainElapsed >= 350ms with a TaskCompletionSource is a meaningful upgrade. The old assertion was measuring a proxy — it could pass even when the drain returned early if Task.Delay jitter happened to push the POST before drain start. The new assertion tests the actual invariant: the POST was fully acknowledged (response received + SetResult called) before drain declared quiet. That's strictly more correct.

Two-window logic (OtlpReceiver.cs:199): The TCP kernel queue → HttpListener.GetContextAsync gap is real and well-described. Requiring two consecutive idle windows is a pragmatic way to buy the in-transit request time to surface.

Concerns

1. Unconditional latency tax on truly-idle receivers

The two-window requirement now unconditionally doubles the minimum drain time (~250ms → ~500ms), even when no requests have ever arrived. Consider gating it on TotalRequests > 0:

// Only require two windows if at least one request was seen;
// a receiver that never got any traffic can exit after one idle window.
var requiredIdleWindows = Volatile.Read(ref _diagnostics.TotalRequests) > 0 ? 2 : 1;
...
if (++consecutiveIdleWindows >= requiredIdleWindows)
{
    return;
}

This avoids penalising test teardown paths where the receiver happened to receive nothing, while preserving the double-window guard for the cases that actually need it.

2. The heuristic is still raceable at the tail

Two windows of 250ms give ~500ms for an in-transit request to surface. On a heavily loaded CI machine, a request could theoretically be in the TCP queue for longer than 500ms. The PR description acknowledges this ("best-effort heuristic"), so it's not a blocker — but it's worth capturing in the DefaultDrainWindow docs that users with slow or batching exporters should raise TUNIT_OTLP_DRAIN_MS. The current XML doc already hints at this, so no change is strictly required.

3. consecutiveIdleWindows is implicitly coupled to stableFor

The comment says "~500ms" but that depends on stableFor = 250ms being stable. If someone later changes stableFor or makes it configurable, the "two windows" semantics silently changes. A minor defensive measure would be a named constant:

const int RequiredConsecutiveIdleWindows = 2;

...so any future reader changing stableFor knows to revisit the window count.

4. Test ordering (minor readability)

await Assert.That(latePostCompleted.Task.IsCompleted).IsTrue();
await Assert.That(receiver.Diagnostics.TracesRequests).IsEqualTo(1);

await latePost;  // cleanup after assertions

The await latePost after the assertions is intentional (assert first, clean up second), but it reads as though the task might still be running. A one-liner comment like // latePost already completed; await to propagate any exception would clarify the intent.

Summary

The core fix is correct and the test improvement is a clear win. The concerns above are mostly about robustness at the edges: (1) is the most actionable and avoids a small but real latency regression on idle receivers. The others are low-priority polish. Happy to approve once (1) is considered.

codacy-production · 2026-05-11T11:04:10Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results

Complexity 0

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

thomhurst temporarily deployed to Pull Requests May 11, 2026 11:02 — with GitHub Actions Inactive

thomhurst had a problem deploying to Pull Requests May 11, 2026 11:02 — with GitHub Actions Failure

claude Bot reviewed May 11, 2026

View reviewed changes

thomhurst merged commit 131eac2 into main May 11, 2026
13 of 14 checks passed

thomhurst deleted the fix/ci-otlp-drain-race branch May 11, 2026 22:08

dependabot Bot mentioned this pull request May 14, 2026

Bump the nuget group with 13 updates metacoder-feff/FEFF.TestFixtures#49

Merged

claude Bot mentioned this pull request May 14, 2026

chore(deps): update tunit to 1.44.39 #5929

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(otel): require two consecutive idle windows in DrainAsync to catch in-transit POSTs#5890

fix(otel): require two consecutive idle windows in DrainAsync to catch in-transit POSTs#5890
thomhurst merged 1 commit into
mainfrom
fix/ci-otlp-drain-race

thomhurst commented May 11, 2026

Uh oh!

claude Bot left a comment

Uh oh!

codacy-production Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thomhurst commented May 11, 2026

Summary

Two changes

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Code Review

What works well

Concerns

Summary

Uh oh!

codacy-production Bot commented May 11, 2026

Up to standards ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant