Skip to content

Feature: Tunnel Health Check and Fast Recovery #30

@crazydi4mond

Description

@crazydi4mond

Problem

Currently, the tunnel lacks robust health monitoring and automatic recovery mechanisms. When the tunnel fails (network issues, resolver unavailability, silent packet loss), detection is slow and recovery requires manual restart.

This becomes especially critical in highly restricted network environments where the tunnel may be established through an intermediary server running in non-interactive or headless mode. In such scenarios, recovering from a failure requires restarting the tunnel from the server side—but access to that server may be limited, intermittent, or available only during narrow time windows. Without self-healing capabilities, a silent tunnel failure can render the connection unusable until the next opportunity for manual intervention.

Current State

  • QUIC keep-alive is enabled (400ms) but never verified
  • Connection close/reset is detected via callbacks, but triggers program exit
  • No automatic reconnection logic
  • No per-resolver health tracking
  • Recursive mode has no timeout for unresponsive resolvers
  • Path quality metrics are collected but not used for failure detection

Proposed Solution

Client-Side (Primary)

  1. Active Health Probing

    • Periodic lightweight probes independent of data transfer
    • Configurable interval (e.g., --health-check-interval)
    • Detect silent failures within seconds
  2. Automatic Reconnection

    • Exponential backoff on connection failure
    • Configurable retry budget and max delay
    • Preserve TCP listeners during reconnection attempts
  3. Per-Resolver Health Tracking

    • Track success/failure rate per resolver
    • Automatic failover to healthy resolvers
    • Circuit breaker pattern for repeatedly failing resolvers
  4. Path Quality Thresholds

    • Use existing RTT/loss metrics for degradation detection
    • Switch paths when quality drops below threshold

Server-Side (Optional)

  • Session idle timeout tracking (already partially exists in UDP fallback)
  • Metrics/logging for client health events

We'd love to hear your feedback on these proposed solutions. We're happy to contribute to the implementation—just wanted to align on the overall strategy and approach before diving into the code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions