Micro-SRE: Agentic SRE Debugging Assistant

An intelligent debugging assistant that automatically gathers, analyzes, and correlates SRE incident data from AlertManager and Kubernetes to help engineers quickly identify root causes using LLM-powered reasoning.

Features

Automated Data Collection: Fetches alerts, pod logs, events, and configurations from Kubernetes
LLM-Powered Analysis: Uses Claude/GPT to analyze incidents and identify root causes
Timeline Generation: Creates chronological view of events leading to incidents
Actionable Recommendations: Provides specific commands and steps to resolve issues
REST API: Easy integration with existing monitoring tools
CLI Tool: Quick debugging from the command line

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Agent Orchestrator (LLM)                  │
└──────────────────┬──────────────────────────────────────────┘
                   │
       ┌───────────┼───────────┐
       │           │           │
┌──────▼─────┐ ┌──▼──────┐ ┌─▼────────┐
│AlertManager│ │   K8S   │ │ Analysis │
│ Collector  │ │Collector│ │  Engine  │
└────────────┘ └─────────┘ └──────────┘

Quick Start

Prerequisites

Go 1.22+
Kubernetes cluster access (kubeconfig)
Anthropic API key (or OpenAI)
AlertManager (optional)

Installation

Clone the repository:

cd /Users/emirozbir/go/src/micro-sre

Install dependencies:

make install-deps

Set up configuration:

cp config/config.yaml config/config.local.yaml
# Edit config/config.local.yaml with your settings

Export your API key:

export ANTHROPIC_API_KEY="your-api-key-here"

Build the application:

make build

Running the Server

# Run directly
make run

# Or use the binary
./bin/micro-sre-server

The server will start on http://localhost:8080

Using the CLI

# Analyze a specific pod
./bin/micro-sre-cli -namespace production -pod api-server-xyz -lookback 2h

# Or with make
make run-cli NAMESPACE=production POD=api-server-xyz LOOKBACK=2h

API Usage

Health Check

curl http://localhost:8080/health

Analyze a Pod

curl -X POST http://localhost:8080/api/v1/analyze/pod \
  -H "Content-Type: application/json" \
  -d '{
    "namespace": "default",
    "pod": "oom-killer-demo",
    "lookback": "1h"
  }'

Analyze an Alert

curl -X POST http://localhost:8080/api/v1/analyze/alert \
  -H "Content-Type: application/json" \
  -d '{
    "alert_id": "abc123",
    "namespace": "production",
    "pod": "api-server-xyz",
    "lookback": "1h"
  }'

Example Response

{
  "alert": {
    "name": "PodCrashLooping",
    "severity": "critical",
    "namespace": "production",
    "pod": "api-server-xyz",
    "started_at": "2026-01-07T10:00:00Z"
  },
  "analysis": {
    "root_cause": "Database connection failure due to incorrect credentials",
    "confidence": "high",
    "reasoning": "Pod logs show repeated 'connection refused' errors...",
    "timeline": [
      {
        "timestamp": "2026-01-07T09:55:00Z",
        "event": "Deployment updated",
        "details": "New version deployed with updated DB config"
      },
      {
        "timestamp": "2026-01-07T10:00:00Z",
        "event": "Pod started crashing",
        "details": "Exit code 1, connection error"
      }
    ],
    "evidence": {
      "logs": [
        {
          "timestamp": "2026-01-07T10:00:15Z",
          "line": "FATAL: password authentication failed for user 'app'"
        }
      ],
      "events": [
        {
          "type": "Warning",
          "reason": "BackOff",
          "message": "Back-off restarting failed container"
        }
      ]
    },
    "recommendations": [
      {
        "priority": "high",
        "action": "Verify database credentials",
        "command": "kubectl get secret db-creds -n production -o yaml"
      },
      {
        "priority": "high",
        "action": "Test database connectivity",
        "command": "kubectl exec -it api-server-xyz -- nc -zv postgres-svc 5432"
      }
    ]
  },
  "collected_data": {
    "logs_lines": 1000,
    "events_count": 12,
    "time_range": "1h"
  }
}

Configuration

Edit config/config.yaml:

alertmanager:
  url: "http://alertmanager:9093"
  poll_interval: "30s"

kubernetes:
  kubeconfig: ""  # empty for in-cluster config
  context: ""     # optional

log_collection:
  default_lookback: "1h"
  max_lookback: "24h"
  tail_lines: 1000
  include_previous: true

llm:
  provider: "anthropic"  # or "openai"
  api_key: "${ANTHROPIC_API_KEY}"
  model: "claude-sonnet-4-5"
  max_tokens: 4096
  temperature: 0.2

server:
  port: 8080
  host: "0.0.0.0"

Deployment

Docker

# Build image
make docker-build

# Run container
make docker-run

Kubernetes

# Apply manifests
kubectl apply -f deploy/k8s/

Development

Project Structure

micro-sre/
├── cmd/
│   ├── server/          # HTTP server
│   └── cli/             # CLI tool
├── internal/
│   ├── agent/           # Agent orchestrator
│   ├── collectors/      # Data collectors (K8S, AlertManager)
│   ├── llm/            # LLM client (Anthropic, OpenAI)
│   ├── models/         # Data models
│   ├── api/            # HTTP handlers
│   └── config/         # Configuration
├── config/             # Config files
├── examples/           # Example payloads
└── DESIGN.md          # Detailed design document

Running Tests

make test

Code Formatting

make fmt

How It Works

Alert Detection: Receives alert from AlertManager (webhook or polling)
Context Gathering: Agent determines what data to collect based on alert metadata
Parallel Collection: Fetches pod logs, events, configurations from K8S API
LLM Analysis: Sends collected data to Claude/GPT for root cause analysis
Result Structuring: Parses LLM response into structured format
Delivery: Returns analysis via API or CLI

Agentic Approach

The "agent" uses LLM reasoning to:

Dynamically decide what data to fetch based on alert type
Iteratively narrow down root causes through multi-step reasoning
Recognize common failure patterns (OOMKilled, CrashLoopBackOff, etc.)
Generate debugging runbooks on-the-fly
Provide actionable recommendations with specific commands

Roadmap

Implement proper JSON parsing from LLM responses
Add support for OpenAI provider
Multi-cluster support
Historical incident storage and pattern learning
Slack/PagerDuty integration
Auto-remediation capabilities
Prometheus metrics correlation
Distributed tracing integration

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT License

Support

For detailed design documentation, see DESIGN.md

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
cmd		cmd
config		config
demo		demo
deploy/k8s		deploy/k8s
examples/alerts		examples/alerts
internal		internal
.gitignore		.gitignore
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Micro-SRE: Agentic SRE Debugging Assistant

Features

Architecture

Quick Start

Prerequisites

Installation

Running the Server

Using the CLI

API Usage

Health Check

Analyze a Pod

Analyze an Alert

Example Response

Configuration

Deployment

Docker

Kubernetes

Development

Project Structure

Running Tests

Code Formatting

How It Works

Agentic Approach

Roadmap

Contributing

License

Support

About

Uh oh!

Releases 2

Packages

Languages

hepapi/hepsre-mini

Folders and files

Latest commit

History

Repository files navigation

Micro-SRE: Agentic SRE Debugging Assistant

Features

Architecture

Quick Start

Prerequisites

Installation

Running the Server

Using the CLI

API Usage

Health Check

Analyze a Pod

Analyze an Alert

Example Response

Configuration

Deployment

Docker

Kubernetes

Development

Project Structure

Running Tests

Code Formatting

How It Works

Agentic Approach

Roadmap

Contributing

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages