Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute #2766

danielmarbach · 2025-03-21T11:39:42Z

This PR improves the reliability of receive count tracking in the Amazon SQS transport by combining the existing local LRU cache with the service-provided ApproximateReceiveCount.
Rather than replacing the local cache or relying solely on the potentially unreliable service count, we now take the maximum of the two values.

Rationale

The current implementation tracks receive counts per message using an endpoint-level LRU cache. However, this is:
- Flaky in scale-out scenarios (multiple endpoints don't share the cache).
- Unreliable on endpoint restarts (cache is wiped).
AWS ApproximateReceiveCount is also not always accurate, and can under-report in certain cases. See:
- https://zaccharles.medium.com/i-actually-thought-you-were-going-to-say-approximatereceivecount-when-i-asked-how-youd-keep-track-282e8f51af17
By taking Max(LRU Cache, ApproximateReceiveCount):
- We hedge against both under-reporting by AWS and cache loss during endpoint restarts.
- We improve reliability without blindly trusting either source.

Edge Case Considered

An unusual edge case was considered:

A message is received once and tracked locally.
Then it's processed via another competing consumer.
The service-provided receive count increases, but the local cache remains lower.
Using the max value ensures we reflect the higher count and prevent regressions.

Trade-offs

✅ Improved accuracy without a major rework.
✅ Safe fallback behavior in edge cases.
⚠️ Still requires maintaining the LRU cache (storage pressure, eviction policies, GC implications) but those are already existing design restrictions today.

Additional context

Discussed with @mauroservienti and we concluded, adding a test might just lead to another flaky test. An acceptance test would have to start an endpoint, handle the message once, stop the scenario, run another scenario and then try to assert that the number of retries is an exact match. It is challenging to abort an acceptance test scenarios in that exact manner.

I have also looked at a transport test, but unfortunately the transport test infrastructure is so rigid that it doesn't really allow you to properly restart pumps and wipe out the previous state. It also has built in assumptions around extracting the stack frame on the StartPump. StartPump fails if it is not the first await statement in a test, and restarting the pump would require to call StartPump multiple times.

This PR also includes a change to the MockClients that changes the underlying collection types to be concurrency safe because we got flaky tests due to the same list elements being overwritten concurrently

…arently something is yielding somewhere

danielmarbach · 2025-03-27T06:40:27Z

Pulling this in for now because it also will address some of the flakyness on the other PR

danielmarbach self-assigned this Mar 21, 2025

danielmarbach mentioned this pull request Mar 21, 2025

Consider using "ApproximateReceiveCount" attribute to compute number of immediate processing failures for ErrorContext #2163

Closed

danielmarbach requested review from lailabougria and mauroservienti March 21, 2025 11:45

danielmarbach changed the title ~~Spike combining the receive count with the local count~~ Use "ApproximateReceiveCount" attribute to compute the number of immediate processing failures for ErrorContext Mar 24, 2025

Spike combining the receive count with the local count

292b86f

danielmarbach force-pushed the receive-count branch from cfcc7b4 to 5ffc459 Compare March 24, 2025 09:09

danielmarbach marked this pull request as ready for review March 24, 2025 09:15

danielmarbach added 2 commits March 24, 2025 21:29

Make the mock clients concurrency safe on the collections because app…

5ead91c

…arently something is yielding somewhere

Adjust the index access to the new collection type

c22f87c

danielmarbach force-pushed the receive-count branch from 5ffc459 to c22f87c Compare March 24, 2025 14:30

danielmarbach added the Improvement label Mar 24, 2025

danielmarbach added this to the 7.3.0 milestone Mar 24, 2025

lailabougria approved these changes Mar 27, 2025

View reviewed changes

danielmarbach merged commit 7f846cd into master Mar 27, 2025
4 checks passed

danielmarbach deleted the receive-count branch March 27, 2025 06:40

danielmarbach changed the title ~~Use "ApproximateReceiveCount" attribute to compute the number of immediate processing failures for ErrorContext~~ Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute Apr 14, 2025

danielmarbach mentioned this pull request Apr 15, 2025

TransportTest infrastructure doesn't allow restarting pumps Particular/NServiceBus#7330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute #2766

Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute #2766

Uh oh!

danielmarbach commented Mar 21, 2025 •

edited

Loading

Uh oh!

danielmarbach commented Mar 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute #2766

Improve the accuracy of immediate processing failure counting in scaled-out scenarios by utilizing the ApproximateReceiveCount attribute #2766

Uh oh!

Conversation

danielmarbach commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Edge Case Considered

Trade-offs

Additional context

Uh oh!

danielmarbach commented Mar 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielmarbach commented Mar 21, 2025 •

edited

Loading