An intelligent debugging assistant that automatically gathers, analyzes, and correlates SRE incident data from AlertManager and Kubernetes to help engineers quickly identify root causes using LLM-powered reasoning.
- Automated Data Collection: Fetches alerts, pod logs, events, and configurations from Kubernetes
- LLM-Powered Analysis: Uses Claude/GPT to analyze incidents and identify root causes
- Timeline Generation: Creates chronological view of events leading to incidents
- Actionable Recommendations: Provides specific commands and steps to resolve issues
- REST API: Easy integration with existing monitoring tools
- CLI Tool: Quick debugging from the command line
┌─────────────────────────────────────────────────────────────┐
│ Agent Orchestrator (LLM) │
└──────────────────┬──────────────────────────────────────────┘
│
┌───────────┼───────────┐
│ │ │
┌──────▼─────┐ ┌──▼──────┐ ┌─▼────────┐
│AlertManager│ │ K8S │ │ Analysis │
│ Collector │ │Collector│ │ Engine │
└────────────┘ └─────────┘ └──────────┘
- Go 1.22+
- Kubernetes cluster access (kubeconfig)
- Anthropic API key (or OpenAI)
- AlertManager (optional)
- Clone the repository:
cd /Users/emirozbir/go/src/micro-sre- Install dependencies:
make install-deps- Set up configuration:
cp config/config.yaml config/config.local.yaml
# Edit config/config.local.yaml with your settings- Export your API key:
export ANTHROPIC_API_KEY="your-api-key-here"- Build the application:
make build# Run directly
make run
# Or use the binary
./bin/micro-sre-serverThe server will start on http://localhost:8080
# Analyze a specific pod
./bin/micro-sre-cli -namespace production -pod api-server-xyz -lookback 2h
# Or with make
make run-cli NAMESPACE=production POD=api-server-xyz LOOKBACK=2hcurl http://localhost:8080/healthcurl -X POST http://localhost:8080/api/v1/analyze/pod \
-H "Content-Type: application/json" \
-d '{
"namespace": "default",
"pod": "oom-killer-demo",
"lookback": "1h"
}'curl -X POST http://localhost:8080/api/v1/analyze/alert \
-H "Content-Type: application/json" \
-d '{
"alert_id": "abc123",
"namespace": "production",
"pod": "api-server-xyz",
"lookback": "1h"
}'{
"alert": {
"name": "PodCrashLooping",
"severity": "critical",
"namespace": "production",
"pod": "api-server-xyz",
"started_at": "2026-01-07T10:00:00Z"
},
"analysis": {
"root_cause": "Database connection failure due to incorrect credentials",
"confidence": "high",
"reasoning": "Pod logs show repeated 'connection refused' errors...",
"timeline": [
{
"timestamp": "2026-01-07T09:55:00Z",
"event": "Deployment updated",
"details": "New version deployed with updated DB config"
},
{
"timestamp": "2026-01-07T10:00:00Z",
"event": "Pod started crashing",
"details": "Exit code 1, connection error"
}
],
"evidence": {
"logs": [
{
"timestamp": "2026-01-07T10:00:15Z",
"line": "FATAL: password authentication failed for user 'app'"
}
],
"events": [
{
"type": "Warning",
"reason": "BackOff",
"message": "Back-off restarting failed container"
}
]
},
"recommendations": [
{
"priority": "high",
"action": "Verify database credentials",
"command": "kubectl get secret db-creds -n production -o yaml"
},
{
"priority": "high",
"action": "Test database connectivity",
"command": "kubectl exec -it api-server-xyz -- nc -zv postgres-svc 5432"
}
]
},
"collected_data": {
"logs_lines": 1000,
"events_count": 12,
"time_range": "1h"
}
}Edit config/config.yaml:
alertmanager:
url: "http://alertmanager:9093"
poll_interval: "30s"
kubernetes:
kubeconfig: "" # empty for in-cluster config
context: "" # optional
log_collection:
default_lookback: "1h"
max_lookback: "24h"
tail_lines: 1000
include_previous: true
llm:
provider: "anthropic" # or "openai"
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-sonnet-4-5"
max_tokens: 4096
temperature: 0.2
server:
port: 8080
host: "0.0.0.0"# Build image
make docker-build
# Run container
make docker-run# Apply manifests
kubectl apply -f deploy/k8s/micro-sre/
├── cmd/
│ ├── server/ # HTTP server
│ └── cli/ # CLI tool
├── internal/
│ ├── agent/ # Agent orchestrator
│ ├── collectors/ # Data collectors (K8S, AlertManager)
│ ├── llm/ # LLM client (Anthropic, OpenAI)
│ ├── models/ # Data models
│ ├── api/ # HTTP handlers
│ └── config/ # Configuration
├── config/ # Config files
├── examples/ # Example payloads
└── DESIGN.md # Detailed design document
make testmake fmt- Alert Detection: Receives alert from AlertManager (webhook or polling)
- Context Gathering: Agent determines what data to collect based on alert metadata
- Parallel Collection: Fetches pod logs, events, configurations from K8S API
- LLM Analysis: Sends collected data to Claude/GPT for root cause analysis
- Result Structuring: Parses LLM response into structured format
- Delivery: Returns analysis via API or CLI
The "agent" uses LLM reasoning to:
- Dynamically decide what data to fetch based on alert type
- Iteratively narrow down root causes through multi-step reasoning
- Recognize common failure patterns (OOMKilled, CrashLoopBackOff, etc.)
- Generate debugging runbooks on-the-fly
- Provide actionable recommendations with specific commands
- Implement proper JSON parsing from LLM responses
- Add support for OpenAI provider
- Multi-cluster support
- Historical incident storage and pattern learning
- Slack/PagerDuty integration
- Auto-remediation capabilities
- Prometheus metrics correlation
- Distributed tracing integration
Contributions are welcome! Please feel free to submit issues or pull requests.
MIT License
For detailed design documentation, see DESIGN.md