fix(ws): install rustls crypto provider for WSS/TLS support#5
Merged
pamungkaski merged 9 commits intomainfrom Jan 21, 2026
Merged
fix(ws): install rustls crypto provider for WSS/TLS support#5pamungkaski merged 9 commits intomainfrom
pamungkaski merged 9 commits intomainfrom
Conversation
Fixes critical panic when connecting to WSS (secure WebSocket) endpoints. Rustls 0.23+ requires explicit crypto provider initialization before any TLS operations. Changes: - Add rustls dependency with aws-lc-rs crypto provider to Cargo.toml - Install crypto provider at startup in main() before async operations - Add BDD integration tests for WSS connections (3 scenarios) - Add graceful step definitions that warn on external endpoint failures - Update DIARY.md with fix details The crypto provider is installed once globally and works for both HTTP (reqwest) and WebSocket (tokio-tungstenite) TLS connections. Tests: - All 85 unit tests pass - Cucumber test harnesses compile successfully - New WSS integration tests with @wss @external tags Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds configuration file and documentation for testing WSS connections with public Holesky endpoints. Changes: - Add config.wss-test.toml with public Holesky WSS endpoints - Primary: publicnode-holesky (wss://ethereum-holesky-rpc.publicnode.com) - Backup: drpc-holesky (demo key) - Update wss_connection.feature with instructions - Document WSS tests in INTEGRATION_TESTS.md Usage: cargo run --release -- --config config.wss-test.toml cargo test --test integration_cucumber -- --tags @wss Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Integrates WSS integration tests into the main `just integration-test` command, running them after Kurtosis tests complete. Also adds WSS tests to GitHub CI with non-critical failure handling. Changes: - Update justfile: WSS tests now run as part of `integration-test` - Runs Kurtosis tests first (must pass) - Then runs WSS tests (failures are non-critical) - Pretty output with test summary - Add GitHub CI job for WSS tests with `continue-on-error: true` - Allows CI to pass even if public endpoints are unavailable - Provides clear warning messages about external dependencies - Update README.md with integrated workflow documentation - Update INTEGRATION_TESTS.md with command descriptions Workflow: just integration-test ├── Kurtosis Integration Tests (critical) └── WSS Integration Tests (non-critical, may fail) WSS test failures exit with code 0 to not break the workflow, with clear warnings that failures are expected due to external dependencies. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes 'Vixy URL not set' error in WSS integration tests by ensuring the background step initializes world.vixy_url before other steps run. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates WSS test config to use ethereum-hoodi-rpc.publicnode.com which provides free public access without API keys. Changes: - Use ethereum-hoodi-rpc.publicnode.com for both primary and backup - Verified endpoint works: eth_blockNumber returns valid response - No API key required (unlike drpc.org demo key which expired) - Updated docs to reflect Holesky endpoint usage Tests now use same Holesky chain for both nodes (not mixing chains). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates all references from 'Holesky' to 'Hoodi' to use the correct network name. Changes: - config.wss-test.toml: node names and comments - tests/features/integration/wss_connection.feature: background step - tests/steps/integration_steps.rs: step definition name and messages - INTEGRATION_TESTS.md: documentation - README.md: integration test description - justfile: WSS test comments - DIARY.md: challenge description The network is called Hoodi, not Holesky. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes issue where GitHub CI was running ALL integration tests instead of just WSS tests. The --tags argument wasn't being properly handled by the cucumber harness. Changes: - Add VIXY_WSS_ONLY environment variable to filter tests by @wss tag - Update integration_cucumber.rs to filter scenarios when VIXY_WSS_ONLY=1 - Update GitHub CI to use VIXY_WSS_ONLY=1 instead of --tags @wss - Update justfile to use VIXY_WSS_ONLY=1 for both test-wss targets - Update documentation to show correct usage Now WSS tests run in isolation without attempting Kurtosis operations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Pin Kurtosis ethereum-package to v6.0.0 to avoid force_update parameter error - Fix cucumber test filtering by properly chaining builder methods - Ensures integration tests run successfully with proper test isolation Tests: All integration tests passing (20 Kurtosis scenarios + 3 WSS scenarios)
merklefruit
approved these changes
Jan 21, 2026
Comment on lines
+37
to
+41
| // Install TLS crypto provider for WebSocket connections (WSS) | ||
| // This must be done before any TLS operations | ||
| rustls::crypto::aws_lc_rs::default_provider() | ||
| .install_default() | ||
| .map_err(|_| eyre::eyre!("Failed to install rustls crypto provider"))?; |
pamungkaski
added a commit
that referenced
this pull request
Jan 23, 2026
…rovements Add comprehensive documentation for production incident investigation and fix plan: **Root Cause Analysis:** - Issue #2 (CRITICAL): Subscription replay responses break clients - Issue #5 (CRITICAL): Never switches back to primary after recovery - Issue #1: Messages lost during reconnection window - Issue #3: Health checks block without timeout **Documentation Added:** 1. WEBSOCKET-RECONNECTION-FIX.md - Complete root cause analysis (5 issues) - Design principles violated - 3-phase fix plan (P0/P1/P2) - Full TDD implementations with code - Rollout strategy and success metrics 2. TESTING-IMPROVEMENTS.md - Analysis of 8 critical test gaps - Why existing tests missed production issues - 4-phase test improvement plan - Complete test implementations (integration, property-based, chaos, load) - CI/CD integration strategy 3. docs/PRODUCTION-INCIDENT-2026-01-23.md - Incident summary and timeline - Quick reference guide - Links to detailed documentation **Key Findings:** - Tests focused on happy paths, missed edge cases - No tests for regular requests after reconnection - No tests for multiple simultaneous subscriptions - No load/chaos testing to catch concurrency issues **Next Steps:** - Phase 0 (24h): Fix Issues #2 and #5 (critical hotfix) - Phase 1 (1wk): Add critical integration tests - Then implement remaining fixes and test improvements Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
pamungkaski
added a commit
that referenced
this pull request
Jan 23, 2026
#2 & #5) This commit implements the Phase 0 critical hotfix for the production incident where WebSocket clients experienced "context deadline exceeded" errors after node reconnection. Root Cause Analysis: - Issue #2: Subscription replay responses were forwarded to clients, breaking their JSON-RPC state machines - Issue #5: Health monitor never switched back to primary after recovery, leaving traffic stuck on backup nodes indefinitely Production Impact: - Timeline: 4 reconnections over 3 hours - Subscriptions dropped from 4-5 to 1 (40% broken) - Half the clients disconnected and reconnected fresh - Metrics showed backup connected 3h after primary recovered Fix #1 - Issue #2: Subscription Replay (src/proxy/ws.rs:673-727) -------------------------------------------------------- Updated reconnect_upstream() to track replayed subscription requests in pending_subscribes before sending them to the new upstream. This ensures the subscription response is consumed internally via handle_upstream_message() instead of being forwarded to the client. Changes: - Added pending_subscribes parameter to reconnect_upstream signature - Insert replayed subscription RPC IDs into pending_subscribes before send - Update call site in run_proxy_loop to pass pending_subscribes - Subscription responses now consumed, client JSON-RPC state preserved Before: Replay response forwarded → client breaks → zombie connection After: Replay response consumed → transparent reconnection Fix #2 - Issue #5: Primary Failback (src/proxy/ws.rs:152-205) ------------------------------------------------------- Updated health_monitor() to check for better (primary) nodes, not just whether the current node is unhealthy. Now automatically switches back to primary when it recovers. Changes: - Check select_healthy_node() every cycle - Reconnect if best_name != current_name (handles both unhealthy and better) - Log reason as "current_unhealthy" or "better_available" - Simplified logic to avoid nested ifs (clippy clean) Before: Only reconnects when current node fails → stuck on backup forever After: Always uses best available node → auto-rebalance to primary Testing: ------- - Added 3 critical integration test scenarios (el_proxy.feature:67-107): 1. Regular JSON-RPC requests work after reconnection 2. Multiple subscriptions maintained after reconnection 3. WebSocket switches back to primary when recovered - All 88 unit tests pass - Clippy clean (fixed format string warning) TDD Workflow: ------------ 1. RED: Added test scenarios (unimplemented) 2. GREEN: Implemented fixes 3. REFACTOR: Cleaned up code, fixed clippy 4. VERIFY: All tests pass Expected Impact: --------------- - Zero "context deadline exceeded" errors after reconnection - Auto-switch to primary within 2 seconds of recovery - Transparent reconnection (clients see no internal operations) - No zombie connections Next Steps: ---------- 1. Implement integration test step definitions 2. Run Kurtosis integration tests 3. Deploy to staging for 24h soak test 4. Canary rollout: 10% → 25% → 50% → 100% Related Documentation: --------------------- - WEBSOCKET-RECONNECTION-FIX.md - Complete fix plan - TESTING-IMPROVEMENTS.md - Test gap analysis - docs/PRODUCTION-INCIDENT-2026-01-23.md - Incident summary - DIARY.md - Updated with implementation details Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
pamungkaski
added a commit
that referenced
this pull request
Jan 23, 2026
…#5) Added comprehensive step definitions for the 3 critical WebSocket reconnection test scenarios: Scenario 1: Regular JSON-RPC requests work after reconnection (Issue #2) - Verifies eth_blockNumber requests continue working after reconnection - Checks that NO subscription replay responses are forwarded to client - Validates response time is reasonable Scenario 2: Multiple subscriptions maintained after reconnection (Issue #2) - Tests multiple simultaneous subscriptions (newHeads + newPendingTransactions) - Verifies subscriptions with specific RPC IDs (100, 101) - Ensures regular requests (RPC ID 200) don't interfere - Confirms no replay responses leak to client Scenario 3: WebSocket switches back to primary when recovered (Issue #5) - Validates metrics show primary node connected initially - Triggers failover to backup when primary stops - Verifies auto-switch back to primary after recovery - Ensures WebSocket connection remains stable throughout Step definitions added (20 new steps): - send_eth_block_number_and_receive - wait_for_reconnection - send_eth_block_number_ws - should_not_receive_replay_responses - response_time_less_than - subscribe_with_rpc_id - receive_confirmation_for_both - both_subscriptions_active - receive_notifications_for_both - send_eth_block_number_with_id - receive_block_number_response_with_id - should_not_receive_replay_with_ids - metrics_show_primary_connected - wait_for_failover_to_backup - metrics_should_show_backup - metrics_should_show_primary - websocket_should_still_work - receive_notifications_without_interruption These step definitions enable full integration testing of the Phase 0 fixes. Some steps use graceful degradation for Kurtosis environments where full block production may not be available. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
pamungkaski
added a commit
that referenced
this pull request
Jan 23, 2026
Removed issue numbering from inline comments in: - src/proxy/ws.rs:171 - Changed 'Issue #5 fix' to descriptive comment - tests/steps/integration_steps.rs:1345 - Changed 'Issue #2 and #5' to descriptive header All Issue# references now removed from code files (only remain in documentation). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes critical panic when connecting to WSS (secure WebSocket) endpoints. Rustls 0.23+ requires explicit crypto provider initialization before any TLS operations.
Changes:
The crypto provider is installed once globally and works for both HTTP (reqwest) and WebSocket (tokio-tungstenite) TLS connections.
Tests: