feat: implement Phase 2 failure detection with heartbeat sampling and hint caps#52
feat: implement Phase 2 failure detection with heartbeat sampling and hint caps#52
Conversation
… hint caps - Add heartbeat peer sampling with configurable size (WithDistHeartbeatSample) - Implement node state transition metrics (suspect/dead counters) - Add global hint queue caps by count (WithDistHintMaxTotal) and bytes (WithDistHintMaxBytes) - Track membership version for cluster state changes - Expose membership snapshot API with state distribution - Add comprehensive test coverage for failure recovery, hint caps, and sampling - Update documentation to reflect Phase 2 completion status - Refactor hint replay logic for better maintainability - Add approximate byte accounting for queued hints with new metrics This completes the experimental failure detection system outlined in the roadmap Phase 2, providing better scalability through sampling and resource protection via global hint limits.
|
Running Code Quality on PRs by uploading data to Trunk will soon be removed. You can still run checks on your PRs using trunk-action - see the migration guide for more information. |
There was a problem hiding this comment.
Pull Request Overview
This PR implements Phase 2 of the experimental failure detection system, adding heartbeat peer sampling, node state transition tracking, and global hint queue limits for better scalability and resource protection.
- Implements configurable heartbeat sampling to probe random peers instead of all peers each tick
- Adds global hint queue caps by count and bytes with metrics for dropped hints due to limits
- Tracks membership version changes and exposes node state transitions (suspect/dead counters)
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/backend/dist_memory.go | Core implementation of heartbeat sampling, global hint caps, and state transition metrics |
| internal/cluster/membership.go | Add membership versioning for cluster state change tracking |
| internal/cluster/version.go | New atomic version tracker for membership changes |
| tests/ | Comprehensive test coverage for failure recovery, hint caps, and heartbeat sampling |
| ROADMAP.md | Update Phase 2 status to completed |
| README.md | Document new features and configuration options |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later | ||
| _ = dm // acknowledge receiver intentionally (satisfy lint under current rule set) |
There was a problem hiding this comment.
[nitpick] The comment and unused receiver acknowledgment is unnecessary. Consider making this a standalone function or remove the receiver if it's not needed for symmetry.
| func (dm *DistMemory) approxHintSize(item *cache.Item) int64 { // receiver retained for symmetry; may use config later | |
| _ = dm // acknowledge receiver intentionally (satisfy lint under current rule set) | |
| func approxHintSize(item *cache.Item) int64 { |
| // Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any). | ||
| package tests | ||
|
|
||
| // (File intentionally left empty after consolidation of helpers.) |
There was a problem hiding this comment.
This empty file with a comment about consolidation suggests incomplete cleanup. Consider removing this file entirely if it's no longer needed.
| // Package tests provides shared test helpers (duplicate directory retained to appease earlier imports if any). | |
| package tests | |
| // (File intentionally left empty after consolidation of helpers.) |
| if ver < 3 { // initial upserts already increment version; tolerate timing variance | ||
| t.Fatalf("expected membership version >=4, got %v", verAny) |
There was a problem hiding this comment.
[nitpick] The magic number 3 in the version check lacks clear justification. Consider calculating the expected minimum version based on the number of operations or using a named constant.
| if ver < 3 { // initial upserts already increment version; tolerate timing variance | |
| t.Fatalf("expected membership version >=4, got %v", verAny) | |
| if ver < initialUpserts { // initial upserts already increment version; tolerate timing variance | |
| t.Fatalf("expected membership version >=%d, got %v", initialUpserts, verAny) |
This completes the experimental failure detection system outlined in the roadmap Phase 2, providing better scalability through sampling and resource protection via global hint limits.