Skip to content

Add metrics/monitoring for caught panics in connection tasks #217

@coderabbitai

Description

@coderabbitai

Context

Following PR #215 which implemented panic handling for connection tasks, we should add production monitoring capabilities to track how often connection panics occur.

Problem

Currently, panics in connection tasks are caught and logged, but there's no easy way for operators to:

  • Monitor panic frequency in production
  • Set up alerts for elevated panic rates
  • Analyze trends in connection task failures
  • Correlate panics with specific deployment changes or traffic patterns

Proposed Solutions

Option 1: Metrics Counter

Add a metrics counter using a standard metrics library:

use metrics::counter;

// In the panic handling code
tracing::error!("Connection task panicked: {:?}, peer_addr: {:?}", panic_info, peer_addr);
counter!("wireframe.connection.panics", 1, "peer_addr" => peer_addr.to_string());

Option 2: Custom Hook/Callback

Expose a configurable hook for panic events:

pub trait PanicHandler: Send + Sync {
    fn on_connection_panic(&self, peer_addr: Option<SocketAddr>, panic_info: &dyn Any);
}

// In server configuration
impl ServerBuilder {
    pub fn with_panic_handler<H: PanicHandler + 'static>(mut self, handler: H) -> Self {
        self.panic_handler = Some(Box::new(handler));
        self
    }
}

Option 3: Structured Logging Enhancement

Enhance the existing tracing with structured fields for easier monitoring:

tracing::error!(
    peer_addr = ?peer_addr,
    panic_type = std::any::type_name_of_val(&**panic_info),
    "Connection task panicked"
);

Implementation Considerations

  1. Minimal overhead: Metrics collection should not impact performance
  2. Configurable: Allow disabling metrics collection if not needed
  3. Standard integration: Use common metrics libraries (prometheus, statsd, etc.)
  4. Privacy: Avoid logging sensitive connection data in metrics

Benefits

  • Operational visibility: Monitor connection stability in production
  • Proactive debugging: Identify problematic connection patterns
  • Performance insights: Correlate panics with system load/configuration
  • SLA monitoring: Track reliability metrics for the server

Acceptance Criteria

  • Metrics are collected when connection tasks panic
  • Metrics include relevant dimensions (peer_addr pattern, timestamp)
  • Metrics collection can be disabled via configuration
  • Documentation explains how to set up monitoring dashboards
  • Minimal performance impact on happy path

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions