feat(ws): Add health-aware WebSocket reconnection with subscription replay by pamungkaski · Pull Request #3 · chainbound/vixy

pamungkaski · 2026-01-15T10:13:33Z

No description provided.

…eplay Implement automatic WebSocket reconnection when upstream EL nodes become unhealthy. The proxy now monitors node health and seamlessly switches to healthy nodes while preserving client subscriptions. Key features: - SubscriptionTracker to track active eth_subscribe requests - Health monitor task that polls node health every second - Automatic reconnection with subscription replay on new upstream - Subscription ID translation to preserve client-facing IDs - Comprehensive metrics for connections, reconnections, and subscriptions New metrics added: - ws_connections_active/total: Track active and lifetime connections - ws_messages_total: Count messages by direction (upstream/downstream) - ws_reconnections_total: Successful reconnection count - ws_reconnection_attempts_total: Attempts by status (success/failed) - ws_subscriptions_active/total: Track active subscriptions - ws_upstream_node: Current upstream node indicator Integration tests added for WebSocket failover scenarios. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add metrics recording in the health monitor to ensure metrics are populated from startup. This fixes the Prometheus metrics integration test which was failing because no metrics were being recorded. Metrics now recorded in monitor: - EL/CL chain head values - Per-node block/slot numbers, lag, and health status - Healthy node counts for EL and CL - Failover status and counter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Collapse nested if statements in WebSocket proxy code to satisfy clippy's collapsible_if lint. Uses Rust's let-chain syntax to combine conditions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The el_requests_total, el_request_duration_seconds, cl_requests_total, and cl_request_duration_seconds metrics were defined but never called. Now both el_proxy_handler and cl_proxy_handler record: - Request count (incremented on each request) - Request duration (histogram observation) EL requests also include the tier label (primary/backup). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

merklefruit · 2026-01-15T13:50:50Z

DIARY.md

+- Subscription ID translation is essential for seamless client experience
+- Rust's type system (clippy) encourages clean abstractions via type aliases
+
+**Mood:** Accomplished - this was a significant feature addition with real-world value!


Claude feels accomplished, nice!

we are closer to AGI

merklefruit

lgtm

…rovements Add comprehensive documentation for production incident investigation and fix plan: **Root Cause Analysis:** - Issue #2 (CRITICAL): Subscription replay responses break clients - Issue #5 (CRITICAL): Never switches back to primary after recovery - Issue #1: Messages lost during reconnection window - Issue #3: Health checks block without timeout **Documentation Added:** 1. WEBSOCKET-RECONNECTION-FIX.md - Complete root cause analysis (5 issues) - Design principles violated - 3-phase fix plan (P0/P1/P2) - Full TDD implementations with code - Rollout strategy and success metrics 2. TESTING-IMPROVEMENTS.md - Analysis of 8 critical test gaps - Why existing tests missed production issues - 4-phase test improvement plan - Complete test implementations (integration, property-based, chaos, load) - CI/CD integration strategy 3. docs/PRODUCTION-INCIDENT-2026-01-23.md - Incident summary and timeline - Quick reference guide - Links to detailed documentation **Key Findings:** - Tests focused on happy paths, missed edge cases - No tests for regular requests after reconnection - No tests for multiple simultaneous subscriptions - No load/chaos testing to catch concurrency issues **Next Steps:** - Phase 0 (24h): Fix Issues #2 and #5 (critical hotfix) - Phase 1 (1wk): Add critical integration tests - Then implement remaining fixes and test improvements Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implemented remaining fixes from production incident timeline to eliminate lock starvation, improve health check performance, and prevent message loss. Issue #3 (Phase 1): Health check timeouts - Added 5-second timeouts to all HTTP clients in health checking - Prevents health checks from blocking indefinitely when nodes are unresponsive - Affected files: src/health/el.rs, src/health/cl.rs Issue #4 (Phase 2): Concurrent health checks - Refactored monitor to not hold write locks during I/O operations - Pattern: read lock → concurrent checks → write lock for updates - Uses futures_util::future::join_all for parallel execution - Eliminates lock contention between health monitor and WebSocket monitor - Reduces health check cycle time from O(n) to O(1) - Affected files: src/monitor.rs Issue #1 (Phase 1): Message queueing during reconnection - Added VecDeque message queue and AtomicBool reconnecting flag - Messages from clients queued during reconnection instead of being dropped - Queue replayed after successful reconnection - Prevents message loss during 2-5 second reconnection window - Affected files: src/proxy/ws.rs Testing: - All 88 unit tests pass - Clippy clean (no warnings) - Integration tests: 26 scenarios (25 passed, 1 skipped), 160 steps passed - Kurtosis: 23 scenarios, 144 steps - WSS: 3 scenarios, 16 steps Impact: - Eliminates lock starvation during health checks - Improves system responsiveness under load - Zero message loss during reconnection - Health checks timeout properly (5s max) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replaced external issue/finding references with self-explanatory comments that describe what the code does and why, without requiring access to session documentation. Changes: - Health checks: "Use timeout to prevent blocking" instead of "Issue #3 Fix" - Monitor: "Don't hold write lock during I/O to prevent lock contention" - WebSocket reconnection: Clear explanations of queueing, background tasks, replay - Subscription handling: "Replayed response from reconnection" instead of "Issue #2" Benefits: - Comments are self-contained and understandable without external context - Future developers don't need access to issue tracking - Code is more maintainable and easier to understand - Explains WHY not just reference ticket numbers All tests pass: 92/92 unit tests ✅ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

pamungkaski and others added 4 commits January 15, 2026 16:39

style(ws): Fix clippy collapsible_if warnings

0eac1ef

Collapse nested if statements in WebSocket proxy code to satisfy clippy's collapsible_if lint. Uses Rust's let-chain syntax to combine conditions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

merklefruit reviewed Jan 15, 2026

View reviewed changes

merklefruit approved these changes Jan 15, 2026

View reviewed changes

pamungkaski merged commit 1275127 into main Jan 15, 2026
5 checks passed

merklefruit deleted the ki/fix/ws-reconnection-and-metrics branch January 16, 2026 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ws): Add health-aware WebSocket reconnection with subscription replay#3

feat(ws): Add health-aware WebSocket reconnection with subscription replay#3
pamungkaski merged 4 commits intomainfrom
ki/fix/ws-reconnection-and-metrics

pamungkaski commented Jan 15, 2026

Uh oh!

merklefruit Jan 15, 2026

Uh oh!

pamungkaski Jan 15, 2026

Uh oh!

merklefruit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pamungkaski commented Jan 15, 2026

Uh oh!

merklefruit Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

pamungkaski Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

merklefruit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants