Problem
Currently, the tunnel lacks robust health monitoring and automatic recovery mechanisms. When the tunnel fails (network issues, resolver unavailability, silent packet loss), detection is slow and recovery requires manual restart.
This becomes especially critical in highly restricted network environments where the tunnel may be established through an intermediary server running in non-interactive or headless mode. In such scenarios, recovering from a failure requires restarting the tunnel from the server side—but access to that server may be limited, intermittent, or available only during narrow time windows. Without self-healing capabilities, a silent tunnel failure can render the connection unusable until the next opportunity for manual intervention.
Current State
- QUIC keep-alive is enabled (400ms) but never verified
- Connection close/reset is detected via callbacks, but triggers program exit
- No automatic reconnection logic
- No per-resolver health tracking
- Recursive mode has no timeout for unresponsive resolvers
- Path quality metrics are collected but not used for failure detection
Proposed Solution
Client-Side (Primary)
-
Active Health Probing
- Periodic lightweight probes independent of data transfer
- Configurable interval (e.g.,
--health-check-interval)
- Detect silent failures within seconds
-
Automatic Reconnection
- Exponential backoff on connection failure
- Configurable retry budget and max delay
- Preserve TCP listeners during reconnection attempts
-
Per-Resolver Health Tracking
- Track success/failure rate per resolver
- Automatic failover to healthy resolvers
- Circuit breaker pattern for repeatedly failing resolvers
-
Path Quality Thresholds
- Use existing RTT/loss metrics for degradation detection
- Switch paths when quality drops below threshold
Server-Side (Optional)
- Session idle timeout tracking (already partially exists in UDP fallback)
- Metrics/logging for client health events
We'd love to hear your feedback on these proposed solutions. We're happy to contribute to the implementation—just wanted to align on the overall strategy and approach before diving into the code.
Problem
Currently, the tunnel lacks robust health monitoring and automatic recovery mechanisms. When the tunnel fails (network issues, resolver unavailability, silent packet loss), detection is slow and recovery requires manual restart.
This becomes especially critical in highly restricted network environments where the tunnel may be established through an intermediary server running in non-interactive or headless mode. In such scenarios, recovering from a failure requires restarting the tunnel from the server side—but access to that server may be limited, intermittent, or available only during narrow time windows. Without self-healing capabilities, a silent tunnel failure can render the connection unusable until the next opportunity for manual intervention.
Current State
Proposed Solution
Client-Side (Primary)
Active Health Probing
--health-check-interval)Automatic Reconnection
Per-Resolver Health Tracking
Path Quality Thresholds
Server-Side (Optional)
We'd love to hear your feedback on these proposed solutions. We're happy to contribute to the implementation—just wanted to align on the overall strategy and approach before diving into the code.