Skip to content

fix(ws): install rustls crypto provider for WSS/TLS support#5

Merged
pamungkaski merged 9 commits intomainfrom
ki/fix/tls-on-ws
Jan 21, 2026
Merged

fix(ws): install rustls crypto provider for WSS/TLS support#5
pamungkaski merged 9 commits intomainfrom
ki/fix/tls-on-ws

Conversation

@pamungkaski
Copy link
Collaborator

Fixes critical panic when connecting to WSS (secure WebSocket) endpoints. Rustls 0.23+ requires explicit crypto provider initialization before any TLS operations.

Changes:

  • Add rustls dependency with aws-lc-rs crypto provider to Cargo.toml
  • Install crypto provider at startup in main() before async operations
  • Add BDD integration tests for WSS connections (3 scenarios)
  • Add graceful step definitions that warn on external endpoint failures
  • Update DIARY.md with fix details

The crypto provider is installed once globally and works for both HTTP (reqwest) and WebSocket (tokio-tungstenite) TLS connections.

Tests:

  • All 85 unit tests pass
  • Cucumber test harnesses compile successfully
  • New WSS integration tests with @wss @external tags

pamungkaski and others added 9 commits January 21, 2026 19:33
Fixes critical panic when connecting to WSS (secure WebSocket) endpoints.
Rustls 0.23+ requires explicit crypto provider initialization before any
TLS operations.

Changes:
- Add rustls dependency with aws-lc-rs crypto provider to Cargo.toml
- Install crypto provider at startup in main() before async operations
- Add BDD integration tests for WSS connections (3 scenarios)
- Add graceful step definitions that warn on external endpoint failures
- Update DIARY.md with fix details

The crypto provider is installed once globally and works for both HTTP
(reqwest) and WebSocket (tokio-tungstenite) TLS connections.

Tests:
- All 85 unit tests pass
- Cucumber test harnesses compile successfully
- New WSS integration tests with @wss @external tags

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds configuration file and documentation for testing WSS connections
with public Holesky endpoints.

Changes:
- Add config.wss-test.toml with public Holesky WSS endpoints
  - Primary: publicnode-holesky (wss://ethereum-holesky-rpc.publicnode.com)
  - Backup: drpc-holesky (demo key)
- Update wss_connection.feature with instructions
- Document WSS tests in INTEGRATION_TESTS.md

Usage:
  cargo run --release -- --config config.wss-test.toml
  cargo test --test integration_cucumber -- --tags @wss

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Integrates WSS integration tests into the main `just integration-test`
command, running them after Kurtosis tests complete. Also adds WSS tests
to GitHub CI with non-critical failure handling.

Changes:
- Update justfile: WSS tests now run as part of `integration-test`
  - Runs Kurtosis tests first (must pass)
  - Then runs WSS tests (failures are non-critical)
  - Pretty output with test summary
- Add GitHub CI job for WSS tests with `continue-on-error: true`
  - Allows CI to pass even if public endpoints are unavailable
  - Provides clear warning messages about external dependencies
- Update README.md with integrated workflow documentation
- Update INTEGRATION_TESTS.md with command descriptions

Workflow:
  just integration-test
  ├── Kurtosis Integration Tests (critical)
  └── WSS Integration Tests (non-critical, may fail)

WSS test failures exit with code 0 to not break the workflow, with
clear warnings that failures are expected due to external dependencies.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes 'Vixy URL not set' error in WSS integration tests by ensuring
the background step initializes world.vixy_url before other steps run.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates WSS test config to use ethereum-hoodi-rpc.publicnode.com which
provides free public access without API keys.

Changes:
- Use ethereum-hoodi-rpc.publicnode.com for both primary and backup
- Verified endpoint works: eth_blockNumber returns valid response
- No API key required (unlike drpc.org demo key which expired)
- Updated docs to reflect Holesky endpoint usage

Tests now use same Holesky chain for both nodes (not mixing chains).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates all references from 'Holesky' to 'Hoodi' to use the correct
network name.

Changes:
- config.wss-test.toml: node names and comments
- tests/features/integration/wss_connection.feature: background step
- tests/steps/integration_steps.rs: step definition name and messages
- INTEGRATION_TESTS.md: documentation
- README.md: integration test description
- justfile: WSS test comments
- DIARY.md: challenge description

The network is called Hoodi, not Holesky.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes issue where GitHub CI was running ALL integration tests instead of
just WSS tests. The --tags argument wasn't being properly handled by the
cucumber harness.

Changes:
- Add VIXY_WSS_ONLY environment variable to filter tests by @wss tag
- Update integration_cucumber.rs to filter scenarios when VIXY_WSS_ONLY=1
- Update GitHub CI to use VIXY_WSS_ONLY=1 instead of --tags @wss
- Update justfile to use VIXY_WSS_ONLY=1 for both test-wss targets
- Update documentation to show correct usage

Now WSS tests run in isolation without attempting Kurtosis operations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Pin Kurtosis ethereum-package to v6.0.0 to avoid force_update parameter error
- Fix cucumber test filtering by properly chaining builder methods
- Ensures integration tests run successfully with proper test isolation

Tests: All integration tests passing (20 Kurtosis scenarios + 3 WSS scenarios)
Copy link
Contributor

@merklefruit merklefruit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

Comment on lines +37 to +41
// Install TLS crypto provider for WebSocket connections (WSS)
// This must be done before any TLS operations
rustls::crypto::aws_lc_rs::default_provider()
.install_default()
.map_err(|_| eyre::eyre!("Failed to install rustls crypto provider"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pamungkaski pamungkaski merged commit 6c2890b into main Jan 21, 2026
6 checks passed
@merklefruit merklefruit deleted the ki/fix/tls-on-ws branch January 22, 2026 08:48
pamungkaski added a commit that referenced this pull request Jan 23, 2026
…rovements

Add comprehensive documentation for production incident investigation and fix plan:

**Root Cause Analysis:**
- Issue #2 (CRITICAL): Subscription replay responses break clients
- Issue #5 (CRITICAL): Never switches back to primary after recovery
- Issue #1: Messages lost during reconnection window
- Issue #3: Health checks block without timeout

**Documentation Added:**

1. WEBSOCKET-RECONNECTION-FIX.md
   - Complete root cause analysis (5 issues)
   - Design principles violated
   - 3-phase fix plan (P0/P1/P2)
   - Full TDD implementations with code
   - Rollout strategy and success metrics

2. TESTING-IMPROVEMENTS.md
   - Analysis of 8 critical test gaps
   - Why existing tests missed production issues
   - 4-phase test improvement plan
   - Complete test implementations (integration, property-based, chaos, load)
   - CI/CD integration strategy

3. docs/PRODUCTION-INCIDENT-2026-01-23.md
   - Incident summary and timeline
   - Quick reference guide
   - Links to detailed documentation

**Key Findings:**
- Tests focused on happy paths, missed edge cases
- No tests for regular requests after reconnection
- No tests for multiple simultaneous subscriptions
- No load/chaos testing to catch concurrency issues

**Next Steps:**
- Phase 0 (24h): Fix Issues #2 and #5 (critical hotfix)
- Phase 1 (1wk): Add critical integration tests
- Then implement remaining fixes and test improvements

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
pamungkaski added a commit that referenced this pull request Jan 23, 2026
#2 & #5)

This commit implements the Phase 0 critical hotfix for the production incident
where WebSocket clients experienced "context deadline exceeded" errors after
node reconnection.

Root Cause Analysis:
- Issue #2: Subscription replay responses were forwarded to clients, breaking
  their JSON-RPC state machines
- Issue #5: Health monitor never switched back to primary after recovery,
  leaving traffic stuck on backup nodes indefinitely

Production Impact:
- Timeline: 4 reconnections over 3 hours
- Subscriptions dropped from 4-5 to 1 (40% broken)
- Half the clients disconnected and reconnected fresh
- Metrics showed backup connected 3h after primary recovered

Fix #1 - Issue #2: Subscription Replay (src/proxy/ws.rs:673-727)
--------------------------------------------------------
Updated reconnect_upstream() to track replayed subscription requests in
pending_subscribes before sending them to the new upstream. This ensures
the subscription response is consumed internally via handle_upstream_message()
instead of being forwarded to the client.

Changes:
- Added pending_subscribes parameter to reconnect_upstream signature
- Insert replayed subscription RPC IDs into pending_subscribes before send
- Update call site in run_proxy_loop to pass pending_subscribes
- Subscription responses now consumed, client JSON-RPC state preserved

Before: Replay response forwarded → client breaks → zombie connection
After: Replay response consumed → transparent reconnection

Fix #2 - Issue #5: Primary Failback (src/proxy/ws.rs:152-205)
-------------------------------------------------------
Updated health_monitor() to check for better (primary) nodes, not just
whether the current node is unhealthy. Now automatically switches back
to primary when it recovers.

Changes:
- Check select_healthy_node() every cycle
- Reconnect if best_name != current_name (handles both unhealthy and better)
- Log reason as "current_unhealthy" or "better_available"
- Simplified logic to avoid nested ifs (clippy clean)

Before: Only reconnects when current node fails → stuck on backup forever
After: Always uses best available node → auto-rebalance to primary

Testing:
-------
- Added 3 critical integration test scenarios (el_proxy.feature:67-107):
  1. Regular JSON-RPC requests work after reconnection
  2. Multiple subscriptions maintained after reconnection
  3. WebSocket switches back to primary when recovered
- All 88 unit tests pass
- Clippy clean (fixed format string warning)

TDD Workflow:
------------
1. RED: Added test scenarios (unimplemented)
2. GREEN: Implemented fixes
3. REFACTOR: Cleaned up code, fixed clippy
4. VERIFY: All tests pass

Expected Impact:
---------------
- Zero "context deadline exceeded" errors after reconnection
- Auto-switch to primary within 2 seconds of recovery
- Transparent reconnection (clients see no internal operations)
- No zombie connections

Next Steps:
----------
1. Implement integration test step definitions
2. Run Kurtosis integration tests
3. Deploy to staging for 24h soak test
4. Canary rollout: 10% → 25% → 50% → 100%

Related Documentation:
---------------------
- WEBSOCKET-RECONNECTION-FIX.md - Complete fix plan
- TESTING-IMPROVEMENTS.md - Test gap analysis
- docs/PRODUCTION-INCIDENT-2026-01-23.md - Incident summary
- DIARY.md - Updated with implementation details

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
pamungkaski added a commit that referenced this pull request Jan 23, 2026
…#5)

Added comprehensive step definitions for the 3 critical WebSocket reconnection test scenarios:

Scenario 1: Regular JSON-RPC requests work after reconnection (Issue #2)
- Verifies eth_blockNumber requests continue working after reconnection
- Checks that NO subscription replay responses are forwarded to client
- Validates response time is reasonable

Scenario 2: Multiple subscriptions maintained after reconnection (Issue #2)
- Tests multiple simultaneous subscriptions (newHeads + newPendingTransactions)
- Verifies subscriptions with specific RPC IDs (100, 101)
- Ensures regular requests (RPC ID 200) don't interfere
- Confirms no replay responses leak to client

Scenario 3: WebSocket switches back to primary when recovered (Issue #5)
- Validates metrics show primary node connected initially
- Triggers failover to backup when primary stops
- Verifies auto-switch back to primary after recovery
- Ensures WebSocket connection remains stable throughout

Step definitions added (20 new steps):
- send_eth_block_number_and_receive
- wait_for_reconnection
- send_eth_block_number_ws
- should_not_receive_replay_responses
- response_time_less_than
- subscribe_with_rpc_id
- receive_confirmation_for_both
- both_subscriptions_active
- receive_notifications_for_both
- send_eth_block_number_with_id
- receive_block_number_response_with_id
- should_not_receive_replay_with_ids
- metrics_show_primary_connected
- wait_for_failover_to_backup
- metrics_should_show_backup
- metrics_should_show_primary
- websocket_should_still_work
- receive_notifications_without_interruption

These step definitions enable full integration testing of the Phase 0 fixes.
Some steps use graceful degradation for Kurtosis environments where full
block production may not be available.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
pamungkaski added a commit that referenced this pull request Jan 23, 2026
Removed issue numbering from inline comments in:
- src/proxy/ws.rs:171 - Changed 'Issue #5 fix' to descriptive comment
- tests/steps/integration_steps.rs:1345 - Changed 'Issue #2 and #5' to descriptive header

All Issue# references now removed from code files (only remain in documentation).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants