Context
Following PR #215 which implemented panic handling for connection tasks, we should add production monitoring capabilities to track how often connection panics occur.
Problem
Currently, panics in connection tasks are caught and logged, but there's no easy way for operators to:
- Monitor panic frequency in production
- Set up alerts for elevated panic rates
- Analyze trends in connection task failures
- Correlate panics with specific deployment changes or traffic patterns
Proposed Solutions
Option 1: Metrics Counter
Add a metrics counter using a standard metrics library:
use metrics::counter;
// In the panic handling code
tracing::error!("Connection task panicked: {:?}, peer_addr: {:?}", panic_info, peer_addr);
counter!("wireframe.connection.panics", 1, "peer_addr" => peer_addr.to_string());
Option 2: Custom Hook/Callback
Expose a configurable hook for panic events:
pub trait PanicHandler: Send + Sync {
fn on_connection_panic(&self, peer_addr: Option<SocketAddr>, panic_info: &dyn Any);
}
// In server configuration
impl ServerBuilder {
pub fn with_panic_handler<H: PanicHandler + 'static>(mut self, handler: H) -> Self {
self.panic_handler = Some(Box::new(handler));
self
}
}
Option 3: Structured Logging Enhancement
Enhance the existing tracing with structured fields for easier monitoring:
tracing::error!(
peer_addr = ?peer_addr,
panic_type = std::any::type_name_of_val(&**panic_info),
"Connection task panicked"
);
Implementation Considerations
- Minimal overhead: Metrics collection should not impact performance
- Configurable: Allow disabling metrics collection if not needed
- Standard integration: Use common metrics libraries (prometheus, statsd, etc.)
- Privacy: Avoid logging sensitive connection data in metrics
Benefits
- Operational visibility: Monitor connection stability in production
- Proactive debugging: Identify problematic connection patterns
- Performance insights: Correlate panics with system load/configuration
- SLA monitoring: Track reliability metrics for the server
Acceptance Criteria
References
Context
Following PR #215 which implemented panic handling for connection tasks, we should add production monitoring capabilities to track how often connection panics occur.
Problem
Currently, panics in connection tasks are caught and logged, but there's no easy way for operators to:
Proposed Solutions
Option 1: Metrics Counter
Add a metrics counter using a standard metrics library:
Option 2: Custom Hook/Callback
Expose a configurable hook for panic events:
Option 3: Structured Logging Enhancement
Enhance the existing tracing with structured fields for easier monitoring:
Implementation Considerations
Benefits
Acceptance Criteria
References