Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute #2766
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #2163
This PR improves the reliability of receive count tracking in the Amazon SQS transport by combining the existing local LRU cache with the service-provided
ApproximateReceiveCount.Rather than replacing the local cache or relying solely on the potentially unreliable service count, we now take the maximum of the two values.
Rationale
ApproximateReceiveCountis also not always accurate, and can under-report in certain cases. See:Max(LRU Cache, ApproximateReceiveCount):Edge Case Considered
An unusual edge case was considered:
Trade-offs
Additional context
Discussed with @mauroservienti and we concluded, adding a test might just lead to another flaky test. An acceptance test would have to start an endpoint, handle the message once, stop the scenario, run another scenario and then try to assert that the number of retries is an exact match. It is challenging to abort an acceptance test scenarios in that exact manner.
I have also looked at a transport test, but unfortunately the transport test infrastructure is so rigid that it doesn't really allow you to properly restart pumps and wipe out the previous state. It also has built in assumptions around extracting the stack frame on the
StartPump.StartPumpfails if it is not the first await statement in a test, and restarting the pump would require to callStartPumpmultiple times.This PR also includes a change to the MockClients that changes the underlying collection types to be concurrency safe because we got flaky tests due to the same list elements being overwritten concurrently