Skip to content

state: timeout + capacity eviction for PendingBuffer (#40)#157

Merged
intendednull merged 1 commit into
mainfrom
fix/issue-40
Apr 19, 2026
Merged

state: timeout + capacity eviction for PendingBuffer (#40)#157
intendednull merged 1 commit into
mainfrom
fix/issue-40

Conversation

@intendednull
Copy link
Copy Markdown
Owner

Summary

  • Adds age-based + capacity-based eviction to PendingBuffer so events stuck waiting for a predecessor that will never arrive (partition, offline peer) are dropped instead of accumulating forever.
  • Each pending entry now carries an optional insertion timestamp; buffer_for_prev_at(prev, event, now_ms) evicts expired entries first, then enforces the cap. The legacy buffer_for_prev path is retained for callers that can't supply a clock (capacity-only eviction).
  • ReplayRole wires its per-server buffer to the timestamped path using SystemTime::now() and surfaces the total pending count via a new pending_count field on WorkerRoleInfo::Replay so operators can monitor backpressure.
  • Defaults: DEFAULT_PENDING_MAX_ENTRIES = 10_000, DEFAULT_PENDING_MAX_AGE_MS = 3_600_000 (1h). Configurable via ReplayConfig::pending_max_entries / pending_max_age_ms.

Test plan

  • cargo fmt --check
  • cargo clippy -p willow-state --all-targets -- -D warnings
  • cargo clippy -p willow-replay --all-targets -- -D warnings
  • cargo clippy -p willow-common --all-targets -- -D warnings
  • cargo test -p willow-state — 184 passed (includes 7 new tests)
  • cargo test -p willow-replay — 30 passed (includes 2 new tests)
  • cargo test -p willow-worker — 13 passed (unchanged)
  • cargo check --target wasm32-unknown-unknown -p willow-state

New coverage:

  • Age eviction drops stale entries when a later insert advances the clock past max_age_ms.
  • Capacity eviction drops the oldest entry when max_entries + 1 events are buffered.
  • Timestamp-less (legacy) entries are immune to age eviction but still subject to capacity.
  • pending_count() reflects both eviction policies accurately.
  • WorkerRoleInfo::Replay::pending_count is exposed to heartbeat/monitoring consumers.

Closes #40

PendingBuffer now drops stale pending events when their predecessor never
arrives (partition, permanent offline peer). Each entry carries an optional
insertion timestamp; inserts that supply a wall-clock forward through
`buffer_for_prev_at` also sweep expired entries first, then enforce the
capacity cap. Legacy `buffer_for_prev` is retained (capacity-only) so
existing callers keep working.

Defaults: 10_000 max entries, 1h max age (constants exposed as
`DEFAULT_PENDING_MAX_ENTRIES` / `DEFAULT_PENDING_MAX_AGE_MS`).

ReplayRole wires its per-server buffer to `buffer_for_prev_at` using
`SystemTime::now()` and surfaces `pending_count` via a new field on
`WorkerRoleInfo::Replay` so operators can monitor backpressure.

Each eviction logs at `warn!` with the event hash and (for age eviction)
the age in ms.

Tests at the state and replay tiers cover:
- age eviction after `max_age_ms` advances
- capacity eviction when `max_entries + 1` is exceeded
- timestamp-less entries immune to age eviction
- `pending_count()` reflects both eviction policies
- `WorkerRoleInfo::Replay::pending_count` exposed correctly

Closes #40

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intendednull intendednull merged commit f58a6e0 into main Apr 19, 2026
4 checks passed
intendednull added a commit that referenced this pull request Apr 19, 2026
Merge race between #151 and #157: #157 added `pending_count` to
`WorkerRoleInfo::Replay`, then #151's test fixture for the
WorkerAnnouncement sad-path case landed without the new field and
broke `cargo test --workspace` on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add timeout-based eviction for pending events in ReplayRole

1 participant