Skip to content

Fix poison pill bug: Make retry heap unbounded#2012

Closed
the-mann wants to merge 2 commits intosender-block-on-failurefrom
fix-poison-pill-unbounded-heap
Closed

Fix poison pill bug: Make retry heap unbounded#2012
the-mann wants to merge 2 commits intosender-block-on-failurefrom
fix-poison-pill-unbounded-heap

Conversation

@the-mann
Copy link
Contributor

Summary

Fix deadlock when failing log groups exceed concurrency limit by making the retry heap unbounded.

Problem

With bounded retry heap (size = concurrency):

  • Retry heap size = concurrency (e.g., 2)
  • When failing log groups (10) > heap size (2)
  • Heap fills with failed batches
  • Workers block trying to push more failed batches
  • System deadlocks - no progress possible
  • Allowed log groups get starved

Solution

Make retry heap unbounded:

  • Remove maxSize constraint and semaphore
  • Push() is now non-blocking
  • Failed batches queue up without blocking workers
  • Allowed log groups continue publishing normally

Changes

  • Remove maxSize and semaphore from retryHeap struct
  • Make Push() non-blocking (no semaphore wait)
  • Remove semaphore release from PopReady()
  • Update NewRetryHeap() to ignore maxSize parameter (kept for API compatibility)
  • Update TestRetryHeap_SemaphoreBlockingAndUnblockingTestRetryHeap_UnboundedPush
  • Update TestRetryHeapSmallerThanFailingLogGroups to validate fix

Test Results

Before Fix: Test deadlocked after 30s with all goroutines blocked

After Fix: ✅ Test PASSES

Allowed success=5 (all batches published)
Denied attempts=110
Heap size=28 (grew beyond concurrency limit of 2)
No deadlock or starvation

Related

- TestPoisonPillScenario: Validates continuous batch generation with 10 denied + 1 allowed log group
- TestSingleDeniedLogGroup: Baseline test with 1 denied + 1 allowed log group
- TestRetryHeapSmallerThanFailingLogGroups: Demonstrates deadlock when heap size < failing log groups (SKIPPED)

The third test intentionally deadlocks to prove the bug exists when:
- Retry heap size = concurrency (2)
- Number of failing log groups (10) > heap size (2)
- Workers block trying to push to full heap
- System deadlocks, starving allowed log group
Remove max size constraint from retry heap to prevent deadlock when
failing log groups exceed concurrency limit.

Changes:
- Remove maxSize and semaphore from retryHeap struct
- Make Push() non-blocking (no semaphore wait)
- Remove semaphore release from PopReady()
- Update NewRetryHeap() to ignore maxSize parameter (kept for API compatibility)
- Update TestRetryHeap_SemaphoreBlockingAndUnblocking -> TestRetryHeap_UnboundedPush
- Update TestRetryHeapSmallerThanFailingLogGroups to validate fix

Before: With concurrency=2 and 10 failing log groups, retry heap (size=2)
would fill up, causing workers to block on Push(), leading to deadlock.

After: Retry heap is unbounded, allowing all failed batches to be queued
without blocking workers. Allowed log groups continue publishing normally.

Test results:
- TestRetryHeapSmallerThanFailingLogGroups: PASS (5/5 allowed batches published)
- Heap grew to size 28 (beyond concurrency limit of 2)
- No deadlock or starvation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant