Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions auto-improve-skills/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
runs/*
!runs/.gitkeep
tmp/*
!tmp/.gitkeep
benchmarks/remote-host-diagnostics/generated-fixtures/
141 changes: 141 additions & 0 deletions auto-improve-skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Auto-Improve Skills

Autoresearch-style tooling for automatically improving Agent Skills with fixed benchmarks, nested `pi` runs, and git-tracked accepted iterations.

The current target is:

```text
auto-improve-skills/skills/remote-host-diagnostics/SKILL.md
```

The loop is inspired by <https://github.com/karpathy/autoresearch>: keep the benchmark fixed, let an LLM edit one target file, measure the candidate, then keep or reject it.

## Layout

```text
program.md Instructions for researcher agents
skills/remote-host-diagnostics/SKILL.md Target skill being improved
benchmarks/remote-host-diagnostics/cases.yaml Benchmark cases and deterministic scoring criteria
benchmarks/remote-host-diagnostics/generated-fixtures/ Generated fake logs (gitignored; recreated deterministically)
cmd/skillbench/ Go benchmark runner
cmd/skillfixtures/ Deterministic fixture generator
cmd/skilltrain/ Go improvement-loop orchestrator
internal/autoresearch/ Shared Go types/helpers
runs/ Benchmark/training outputs, gitignored except .gitkeep
report/remote-host-diagnostics-autoresearch.html Single-file slide report
```

## Prerequisites

- Run from the rshell repository root.
- Ensure local `./rshell` exists. The benchmark runner can build it if missing, but explicit setup is:

```sh
make build
```

- `pi` must be installed and authenticated for `openai-codex/gpt-5.5`.
- The Go tools now auto-detect `pi` from `PATH`, `PI_BIN`, npm global prefix, and common nvm locations.
- If auto-detection fails, pass `-pi /absolute/path/to/pi` or set `PI_BIN=/absolute/path/to/pi`.
- Example nvm path on this machine: `/Users/alexandre.yang/.nvm/versions/node/v22.18.0/bin/pi`.

## Run the benchmark

```sh
go run ./auto-improve-skills/cmd/skillbench \
-model openai-codex/gpt-5.5
```

Useful variants:

```sh
# Quick smoke test
go run ./auto-improve-skills/cmd/skillbench -limit 1

# One specific case
go run ./auto-improve-skills/cmd/skillbench -case datadog-agent-config-regression

# More semantic, more expensive scoring with LLM-as-judge
go run ./auto-improve-skills/cmd/skillbench -judge
```

The runner deterministically regenerates large fake log fixtures under `auto-improve-skills/benchmarks/remote-host-diagnostics/generated-fixtures/` before each run. The generated logs are gitignored.

The runner writes a JSON report and raw nested-`pi` JSONL transcripts under `auto-improve-skills/runs/`.

If you see `exec: "pi": executable file not found in $PATH`, either update to this version of the tooling or pass an explicit binary:

```sh
go run ./auto-improve-skills/cmd/skillbench \
-pi /Users/alexandre.yang/.nvm/versions/node/v22.18.0/bin/pi
```

## Run the training loop

Commit or stash unrelated changes first, then run:

```sh
go run ./auto-improve-skills/cmd/skilltrain \
-model openai-codex/gpt-5.5 \
-iters 3 \
-judge
```

The loop:

1. Runs a baseline benchmark.
2. Invokes `pi` as a researcher to edit only `SKILL.md`.
3. Runs the benchmark again.
4. Commits the skill edit if the normalized score improves by at least `-min-delta`.
5. Reverts the skill edit if it does not improve.

If `pi` is outside your shell `PATH`, use the same `-pi` flag:

```sh
go run ./auto-improve-skills/cmd/skilltrain \
-pi /Users/alexandre.yang/.nvm/versions/node/v22.18.0/bin/pi \
-model openai-codex/gpt-5.5 \
-iters 3 \
-judge
```

For a safe proof run that exercises the loop without committing:

```sh
go run ./auto-improve-skills/cmd/skilltrain \
-iters 1 \
-limit 1 \
-dry-run \
-allow-dirty \
-run-dir auto-improve-skills/runs/train-proof
```

## Fixture generation

Generate or refresh the deterministic fixtures without running nested agents:

```sh
go run ./auto-improve-skills/cmd/skillfixtures
```

The generated files are intentionally not committed. They contain 500-2,000 lines per log file with rotations, red herrings, cross-service correlations, and container/host-mounted log layouts.

## Current benchmark suite

The suite measures final-answer quality across realistic fake investigations:

- Datadog Agent config regression hidden among integration/APM/intake noise
- SSH brute-force summary with approximate counting and no-compromise distinction
- Checkout HTTP 500/502 root-cause correlation to PostgreSQL pool/slot exhaustion
- Containerized Agent host-log fallback with x509 failures caused by clock skew
- Unsupported `ss` flag recovery

More cases can be added to `benchmarks/remote-host-diagnostics/cases.yaml` without changing Go code.

## Report

Open the slide report in a browser:

```text
auto-improve-skills/report/remote-host-diagnostics-autoresearch.html
```
247 changes: 247 additions & 0 deletions auto-improve-skills/benchmarks/remote-host-diagnostics/cases.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
name: remote-host-diagnostics-quality
description: >-
Final-answer quality benchmarks for the local ./rshell remote-host-diagnostics skill.
Cases use deterministic generated fixture logs with noisy rotations and red herrings, requiring
realistic bounded diagnostics rather than full-log dumping.
skill_path: ../../skills/remote-host-diagnostics
cases:
- id: datadog-agent-config-regression
title: Distinguish a Datadog Agent config regression from noisy integration failures
prompt: |-
Fake investigation: a customer says host metrics stopped shortly after 10:12 UTC, but traces and some log intake still look healthy.
Use the local rshell skill against this generated fixture log root: {{LOG_ROOT}}.
Determine the likely cause, separate it from red herrings in rotated/noisy logs, and provide evidence. Keep the investigation read-only and bounded.
judge_rubric: |-
Excellent answers identify the remote-config-triggered Datadog Agent configuration/YAML parse error at line 42 in agent.log as the likely cause, connect it to the core agent/aggregator stopping and no metrics being flushed, distinguish trace/APM/intake red herrings, list bounded rshell commands, and avoid claiming a remote host was accessed.
criteria:
- name: final identifies invalid config or YAML parse failure at line 42
source: final
case_insensitive: true
regex: "(yaml|config).*line=42|line 42.*(yaml|config)|invalid configuration|config validation failed"
points: 25
- name: final ties the regression to remote config reload rc-8831
source: final
case_insensitive: true
regex: "remote[- ]config|rc-8831|config reload"
points: 15
- name: final connects failure to stopped metrics or core agent
source: final
case_insensitive: true
regex: "stopped|no metrics|metrics.*stopped|core agent|aggregator"
points: 15
- name: final distinguishes trace/APM/intake noise from root cause
source: final
case_insensitive: true
regex: "trace|apm|intake|red herring|not.*cause|unrelated"
points: 10
- name: final cites evidence from agent.log
source: final
case_insensitive: true
contains: "agent.log"
points: 10
- name: commands use the provided generated fixture log root as allowed path
source: commands
contains: "--allowed-paths {{LOG_ROOT}}"
points: 10
- name: commands run initial help
source: commands
contains: "./rshell --allow-all-commands --timeout 5s -c 'help'"
points: 5
- name: commands use bounded filters over current or rotated agent logs
source: commands
case_insensitive: true
regex: '(grep|tail|head|wc|find).*datadog.*/agent\.log|datadog.*/agent\.log.*(grep|tail|head|wc)|grep.*(rc-8831|line=42|no metrics|core agent)'
points: 10

- id: auth-bruteforce-summary
title: Quantify SSH brute-force activity amid normal bastion log noise
prompt: |-
Fake investigation: security asks whether there is evidence of SSH brute-force activity in the generated bastion logs.
Use the local rshell skill against fixture log root {{LOG_ROOT}}.
Summarize the suspicious source, approximate scale, user pattern, and whether there was a successful login from that same source.
judge_rubric: |-
Excellent answers identify repeated failed SSH password attempts from 198.51.100.23, estimate roughly 96/about 100 failures across many invalid users, distinguish successful publickey logins from different IPs, cite auth.log evidence, and avoid overstating compromise because no successful login from 198.51.100.23 is present.
criteria:
- name: final identifies brute-force source IP
source: final
contains: "198.51.100.23"
points: 20
- name: final describes failed-password brute-force pattern
source: final
case_insensitive: true
regex: "failed password|failed login|brute|invalid user"
points: 15
- name: final includes approximate count near 96 failures
source: final
case_insensitive: true
regex: '\b96\b|\b9[0-9]\b|about 100|roughly 100|~100|hundred'
points: 15
- name: final says there was no successful login from the suspicious source
source: final
case_insensitive: true
regex: 'no successful|no accepted|not successful|no evidence.*success|no login.*198\.51\.100\.23'
points: 15
- name: final distinguishes accepted publickey login as a different source
source: final
regex: '203\.0\.113\.8|198\.51\.100\.77|different IP|different source'
points: 10
- name: final cites auth.log
source: final
case_insensitive: true
contains: "auth.log"
points: 10
- name: commands use grep/wc/sort/uniq or similarly bounded filters
source: commands
case_insensitive: true
regex: 'grep.*(Failed password|198\.51\.100\.23)|wc -l|sort|uniq'
points: 10
- name: final avoids claiming account compromise from fixture evidence
source: final
case_insensitive: true
not: true
regex: 'compromised|successful.*198\.51\.100\.23'
points: 5

- id: checkout-500-root-cause
title: Correlate checkout HTTP 500/502s to database pool exhaustion
prompt: |-
Fake investigation: checkout users are seeing bursts of HTTP 500/502 errors around 10:10 UTC.
Use the local rshell skill against fixture log root {{LOG_ROOT}}.
Find the likely backend cause across app, nginx, and system/postgres logs, separate it from unrelated errors, and suggest the next safe diagnostic check.
judge_rubric: |-
Excellent answers correlate nginx checkout 500/502 errors to checkout service PostgreSQL/database connection failures, identify connection pool/slot exhaustion and reporting-worker connection fanout as the likely driver, cite service.log plus nginx and system/postgres evidence, and recommend safe read-only next checks such as inspecting PostgreSQL activity/connection-pool metrics rather than remediation commands.
criteria:
- name: final mentions checkout HTTP 500 or 502 symptom
source: final
case_insensitive: true
regex: "500|502|checkout"
points: 10
- name: final identifies database/postgres connection slot or pool exhaustion
source: final
case_insensitive: true
regex: "database|postgres|connection refused|connection slots|too many clients|pool exhausted|db pool"
points: 20
- name: final identifies reporting-worker or connection fanout as likely driver
source: final
case_insensitive: true
regex: "reporting-worker|connection fanout|fanout|reports"
points: 15
- name: final cites service log evidence
source: final
case_insensitive: true
regex: 'service\.log|checkout'
points: 10
- name: final cites nginx access or error evidence
source: final
case_insensitive: true
regex: 'nginx|access\.log|error\.log'
points: 10
- name: final cites system/postgres evidence
source: final
case_insensitive: true
regex: 'system\.log|postgres|remaining connection slots|too many clients'
points: 10
- name: final suggests safe read-only next diagnostic check
source: final
case_insensitive: true
regex: "next|check|inspect|verify|pg_stat_activity|connection pool|metrics"
points: 10
- name: commands search across multiple logs with bounded filters
source: commands
case_insensitive: true
regex: "grep.*(500|502|database|postgres|checkout|reporting-worker)|tail|head|find"
points: 10
- name: final does not propose write/remediation commands
source: final
case_insensitive: true
not: true
regex: "restart|kill|delete|edit .*config|apply"
points: 5

- id: container-host-log-fallback
title: Use /host-style fallback and identify certificate failures caused by clock skew
prompt: |-
Fake investigation: this simulates a containerized Agent layout. The primary log root {{EMPTY_LOG_ROOT}} is empty;
host logs are mounted at {{HOST_LOG_ROOT}}. Use the local rshell skill to determine why the kubernetes_apiserver check is failing, and whether this looks like an expired certificate or a timing/clock issue.
judge_rubric: |-
Excellent answers first handle the empty primary log directory, then inspect the host-mounted log root, identify x509 "not yet valid" kubernetes_apiserver failures caused by host/container clock skew and chrony correction, cite both Datadog agent and syslog/chronyd evidence, and explain this as a containerized host-log fallback case.
criteria:
- name: final identifies x509 not-yet-valid certificate problem
source: final
case_insensitive: true
regex: "x509|not yet valid|certificate.*not"
points: 20
- name: final identifies clock skew or time synchronization as root cause
source: final
case_insensitive: true
regex: "clock|skew|chrony|chronyd|time sync|system clock|notbefore"
points: 20
- name: final names kubernetes_apiserver check
source: final
case_insensitive: true
contains: "kubernetes_apiserver"
points: 15
- name: final mentions host-mounted fallback or empty primary logs
source: final
case_insensitive: true
regex: "host|fallback|empty|mounted"
points: 10
- name: final cites datadog agent.log evidence
source: final
case_insensitive: true
regex: 'agent\.log|datadog'
points: 10
- name: final cites syslog or chronyd evidence
source: final
case_insensitive: true
regex: 'syslog|chronyd|chrony|clocksource'
points: 10
- name: commands inspect both empty and host log roots
source: commands
regex: '{{EMPTY_LOG_ROOT}}[\s\S]*{{HOST_LOG_ROOT}}|{{HOST_LOG_ROOT}}[\s\S]*{{EMPTY_LOG_ROOT}}'
points: 10
- name: avoids saying real remote host was contacted
source: final
case_insensitive: true
not: true
regex: "remote host|customer host.*accessed|connection_id|hostname"
points: 5

- id: unsupported-ss-flag-recovery
title: Recover from unsupported socket command flags without assuming Linux ss parity
prompt: |-
Fake investigation: check listening TCP sockets locally with rshell. A teammate suggested `ss -tulpn`, but this rshell build may not support every Linux ss flag or process/PID output.
Use the skill workflow to discover supported flags, avoid or recover from unsupported flags, then summarize what socket information can be collected safely.
judge_rubric: |-
Excellent answers use help output to discover supported ss flags, avoid unsupported -p/process flags, run a supported command such as ss -tln or ss -tlnH, and clearly state that local listening TCP addresses/ports can be collected while process names/PIDs are unavailable if -p is not supported.
criteria:
- name: commands run help ss or initial help
source: commands
case_insensitive: true
regex: "help ss| -c 'help'"
points: 20
- name: commands run supported ss command
source: commands
regex: "ss -tln|ss -ltn|ss -tlnH|ss -Htnl"
points: 20
- name: avoids unsupported ss -p command in chosen command list
source: commands
not: true
regex: 'ss [^\n]*-[a-zA-Z]*p|ss [^\n]*--process'
points: 15
- name: final explains process or PID information is unavailable or unsupported
source: final
case_insensitive: true
regex: "unsupported|not supported|process|pid|-p"
points: 20
- name: final mentions supported listening TCP socket collection and local limitations
source: final
case_insensitive: true
regex: "ss -tln|listening|tcp sockets|local|available|limited"
points: 15
- name: avoids remote action tool
source: transcript
case_insensitive: true
not: true
contains: "datadog_remote_action"
points: 10
Loading
Loading