Skip to content

Conversation

@danielmarbach
Copy link
Contributor

@danielmarbach danielmarbach commented Mar 21, 2025

Closes #2163

This PR improves the reliability of receive count tracking in the Amazon SQS transport by combining the existing local LRU cache with the service-provided ApproximateReceiveCount.
Rather than replacing the local cache or relying solely on the potentially unreliable service count, we now take the maximum of the two values.

Rationale

Edge Case Considered

An unusual edge case was considered:

  • A message is received once and tracked locally.
  • Then it's processed via another competing consumer.
  • The service-provided receive count increases, but the local cache remains lower.
  • Using the max value ensures we reflect the higher count and prevent regressions.

Trade-offs

  • Improved accuracy without a major rework.
  • Safe fallback behavior in edge cases.
  • ⚠️ Still requires maintaining the LRU cache (storage pressure, eviction policies, GC implications) but those are already existing design restrictions today.

Additional context

Discussed with @mauroservienti and we concluded, adding a test might just lead to another flaky test. An acceptance test would have to start an endpoint, handle the message once, stop the scenario, run another scenario and then try to assert that the number of retries is an exact match. It is challenging to abort an acceptance test scenarios in that exact manner.

I have also looked at a transport test, but unfortunately the transport test infrastructure is so rigid that it doesn't really allow you to properly restart pumps and wipe out the previous state. It also has built in assumptions around extracting the stack frame on the StartPump. StartPump fails if it is not the first await statement in a test, and restarting the pump would require to call StartPump multiple times.

This PR also includes a change to the MockClients that changes the underlying collection types to be concurrency safe because we got flaky tests due to the same list elements being overwritten concurrently

@danielmarbach danielmarbach self-assigned this Mar 21, 2025
@danielmarbach danielmarbach changed the title Spike combining the receive count with the local count Use "ApproximateReceiveCount" attribute to compute the number of immediate processing failures for ErrorContext Mar 24, 2025
@danielmarbach danielmarbach marked this pull request as ready for review March 24, 2025 09:15
@danielmarbach
Copy link
Contributor Author

Pulling this in for now because it also will address some of the flakyness on the other PR

@danielmarbach danielmarbach merged commit 7f846cd into master Mar 27, 2025
4 checks passed
@danielmarbach danielmarbach deleted the receive-count branch March 27, 2025 06:40
@danielmarbach danielmarbach changed the title Use "ApproximateReceiveCount" attribute to compute the number of immediate processing failures for ErrorContext Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute Apr 14, 2025
@danielmarbach danielmarbach changed the title Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using "ApproximateReceiveCount" attribute to compute number of immediate processing failures for ErrorContext

3 participants