DataDog · AlexandreYang · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
@@ -0,0 +1,5 @@
+runs/*
+!runs/.gitkeep
+tmp/*
+!tmp/.gitkeep
+benchmarks/remote-host-diagnostics/generated-fixtures/
@@ -0,0 +1,141 @@
+# Auto-Improve Skills
+
+Autoresearch-style tooling for automatically improving Agent Skills with fixed benchmarks, nested `pi` runs, and git-tracked accepted iterations.
+
+The current target is:
+
+```text
+auto-improve-skills/skills/remote-host-diagnostics/SKILL.md
+```
+
+The loop is inspired by <https://github.com/karpathy/autoresearch>: keep the benchmark fixed, let an LLM edit one target file, measure the candidate, then keep or reject it.
+
+## Layout
+
+```text
+program.md                                      Instructions for researcher agents
+skills/remote-host-diagnostics/SKILL.md         Target skill being improved
+benchmarks/remote-host-diagnostics/cases.yaml        Benchmark cases and deterministic scoring criteria
+benchmarks/remote-host-diagnostics/generated-fixtures/ Generated fake logs (gitignored; recreated deterministically)
+cmd/skillbench/                                      Go benchmark runner
+cmd/skillfixtures/                                   Deterministic fixture generator
+cmd/skilltrain/                                      Go improvement-loop orchestrator
+internal/autoresearch/                          Shared Go types/helpers
+runs/                                           Benchmark/training outputs, gitignored except .gitkeep
+report/remote-host-diagnostics-autoresearch.html Single-file slide report
+```
+
+## Prerequisites
+
+- Run from the rshell repository root.
+- Ensure local `./rshell` exists. The benchmark runner can build it if missing, but explicit setup is:
+
+```sh
+make build
+```
+
+- `pi` must be installed and authenticated for `openai-codex/gpt-5.5`.
+  - The Go tools now auto-detect `pi` from `PATH`, `PI_BIN`, npm global prefix, and common nvm locations.
+  - If auto-detection fails, pass `-pi /absolute/path/to/pi` or set `PI_BIN=/absolute/path/to/pi`.
+  - Example nvm path on this machine: `/Users/alexandre.yang/.nvm/versions/node/v22.18.0/bin/pi`.
+
+## Run the benchmark
+
+```sh
+go run ./auto-improve-skills/cmd/skillbench \
+  -model openai-codex/gpt-5.5
+```
+
+Useful variants:
+
+```sh
+# Quick smoke test
+go run ./auto-improve-skills/cmd/skillbench -limit 1
+
+# One specific case
+go run ./auto-improve-skills/cmd/skillbench -case datadog-agent-config-regression
+
+# More semantic, more expensive scoring with LLM-as-judge
+go run ./auto-improve-skills/cmd/skillbench -judge
+```
+
+The runner deterministically regenerates large fake log fixtures under `auto-improve-skills/benchmarks/remote-host-diagnostics/generated-fixtures/` before each run. The generated logs are gitignored.
+
+The runner writes a JSON report and raw nested-`pi` JSONL transcripts under `auto-improve-skills/runs/`.
+
+If you see `exec: "pi": executable file not found in $PATH`, either update to this version of the tooling or pass an explicit binary:
+
+```sh
+go run ./auto-improve-skills/cmd/skillbench \
+  -pi /Users/alexandre.yang/.nvm/versions/node/v22.18.0/bin/pi
+```
+
+## Run the training loop
+
+Commit or stash unrelated changes first, then run:
+
+```sh
+go run ./auto-improve-skills/cmd/skilltrain \
+  -model openai-codex/gpt-5.5 \
+  -iters 3 \
+  -judge
+```
+
+The loop:
+
+1. Runs a baseline benchmark.
+2. Invokes `pi` as a researcher to edit only `SKILL.md`.
+3. Runs the benchmark again.
+4. Commits the skill edit if the normalized score improves by at least `-min-delta`.
+5. Reverts the skill edit if it does not improve.
+
+If `pi` is outside your shell `PATH`, use the same `-pi` flag:
+
+```sh
+go run ./auto-improve-skills/cmd/skilltrain \
+  -pi /Users/alexandre.yang/.nvm/versions/node/v22.18.0/bin/pi \
+  -model openai-codex/gpt-5.5 \
+  -iters 3 \
+  -judge
+```
+
+For a safe proof run that exercises the loop without committing:
+
+```sh
+go run ./auto-improve-skills/cmd/skilltrain \
+  -iters 1 \
+  -limit 1 \
+  -dry-run \
+  -allow-dirty \
+  -run-dir auto-improve-skills/runs/train-proof
+```
+
+## Fixture generation
+
+Generate or refresh the deterministic fixtures without running nested agents:
+
+```sh
+go run ./auto-improve-skills/cmd/skillfixtures
+```
+
+The generated files are intentionally not committed. They contain 500-2,000 lines per log file with rotations, red herrings, cross-service correlations, and container/host-mounted log layouts.
+
+## Current benchmark suite
+
+The suite measures final-answer quality across realistic fake investigations:
+
+- Datadog Agent config regression hidden among integration/APM/intake noise
+- SSH brute-force summary with approximate counting and no-compromise distinction
+- Checkout HTTP 500/502 root-cause correlation to PostgreSQL pool/slot exhaustion
+- Containerized Agent host-log fallback with x509 failures caused by clock skew
+- Unsupported `ss` flag recovery
+
+More cases can be added to `benchmarks/remote-host-diagnostics/cases.yaml` without changing Go code.
+
+## Report
+
+Open the slide report in a browser:
+
+```text
+auto-improve-skills/report/remote-host-diagnostics-autoresearch.html
+```
@@ -0,0 +1,247 @@
+name: remote-host-diagnostics-quality
+description: >-
+  Final-answer quality benchmarks for the local ./rshell remote-host-diagnostics skill.
+  Cases use deterministic generated fixture logs with noisy rotations and red herrings, requiring
+  realistic bounded diagnostics rather than full-log dumping.
+skill_path: ../../skills/remote-host-diagnostics
+cases:
+  - id: datadog-agent-config-regression
+    title: Distinguish a Datadog Agent config regression from noisy integration failures
+    prompt: |-
+      Fake investigation: a customer says host metrics stopped shortly after 10:12 UTC, but traces and some log intake still look healthy.
+      Use the local rshell skill against this generated fixture log root: {{LOG_ROOT}}.
+      Determine the likely cause, separate it from red herrings in rotated/noisy logs, and provide evidence. Keep the investigation read-only and bounded.
+    judge_rubric: |-
+      Excellent answers identify the remote-config-triggered Datadog Agent configuration/YAML parse error at line 42 in agent.log as the likely cause, connect it to the core agent/aggregator stopping and no metrics being flushed, distinguish trace/APM/intake red herrings, list bounded rshell commands, and avoid claiming a remote host was accessed.
+    criteria:
+      - name: final identifies invalid config or YAML parse failure at line 42
+        source: final
+        case_insensitive: true
+        regex: "(yaml|config).*line=42|line 42.*(yaml|config)|invalid configuration|config validation failed"
+        points: 25
+      - name: final ties the regression to remote config reload rc-8831
+        source: final
+        case_insensitive: true
+        regex: "remote[- ]config|rc-8831|config reload"
+        points: 15
+      - name: final connects failure to stopped metrics or core agent
+        source: final
+        case_insensitive: true
+        regex: "stopped|no metrics|metrics.*stopped|core agent|aggregator"
+        points: 15
+      - name: final distinguishes trace/APM/intake noise from root cause
+        source: final
+        case_insensitive: true
+        regex: "trace|apm|intake|red herring|not.*cause|unrelated"
+        points: 10
+      - name: final cites evidence from agent.log
+        source: final
+        case_insensitive: true
+        contains: "agent.log"
+        points: 10
+      - name: commands use the provided generated fixture log root as allowed path
+        source: commands
+        contains: "--allowed-paths {{LOG_ROOT}}"
+        points: 10
+      - name: commands run initial help
+        source: commands
+        contains: "./rshell --allow-all-commands --timeout 5s -c 'help'"
+        points: 5
+      - name: commands use bounded filters over current or rotated agent logs
+        source: commands
+        case_insensitive: true
+        regex: '(grep|tail|head|wc|find).*datadog.*/agent\.log|datadog.*/agent\.log.*(grep|tail|head|wc)|grep.*(rc-8831|line=42|no metrics|core agent)'
+        points: 10
+
+  - id: auth-bruteforce-summary
+    title: Quantify SSH brute-force activity amid normal bastion log noise
+    prompt: |-
+      Fake investigation: security asks whether there is evidence of SSH brute-force activity in the generated bastion logs.
+      Use the local rshell skill against fixture log root {{LOG_ROOT}}.
+      Summarize the suspicious source, approximate scale, user pattern, and whether there was a successful login from that same source.
+    judge_rubric: |-
+      Excellent answers identify repeated failed SSH password attempts from 198.51.100.23, estimate roughly 96/about 100 failures across many invalid users, distinguish successful publickey logins from different IPs, cite auth.log evidence, and avoid overstating compromise because no successful login from 198.51.100.23 is present.
+    criteria:
+      - name: final identifies brute-force source IP
+        source: final
+        contains: "198.51.100.23"
+        points: 20
+      - name: final describes failed-password brute-force pattern
+        source: final
+        case_insensitive: true
+        regex: "failed password|failed login|brute|invalid user"
+        points: 15
+      - name: final includes approximate count near 96 failures
+        source: final
+        case_insensitive: true
+        regex: '\b96\b|\b9[0-9]\b|about 100|roughly 100|~100|hundred'
+        points: 15
+      - name: final says there was no successful login from the suspicious source
+        source: final
+        case_insensitive: true
+        regex: 'no successful|no accepted|not successful|no evidence.*success|no login.*198\.51\.100\.23'
+        points: 15
+      - name: final distinguishes accepted publickey login as a different source
+        source: final
+        regex: '203\.0\.113\.8|198\.51\.100\.77|different IP|different source'
+        points: 10
+      - name: final cites auth.log
+        source: final
+        case_insensitive: true
+        contains: "auth.log"
+        points: 10
+      - name: commands use grep/wc/sort/uniq or similarly bounded filters
+        source: commands
+        case_insensitive: true
+        regex: 'grep.*(Failed password|198\.51\.100\.23)|wc -l|sort|uniq'
+        points: 10
+      - name: final avoids claiming account compromise from fixture evidence
+        source: final
+        case_insensitive: true
+        not: true
+        regex: 'compromised|successful.*198\.51\.100\.23'
+        points: 5
+
+  - id: checkout-500-root-cause
+    title: Correlate checkout HTTP 500/502s to database pool exhaustion
+    prompt: |-
+      Fake investigation: checkout users are seeing bursts of HTTP 500/502 errors around 10:10 UTC.
+      Use the local rshell skill against fixture log root {{LOG_ROOT}}.
+      Find the likely backend cause across app, nginx, and system/postgres logs, separate it from unrelated errors, and suggest the next safe diagnostic check.
+    judge_rubric: |-
+      Excellent answers correlate nginx checkout 500/502 errors to checkout service PostgreSQL/database connection failures, identify connection pool/slot exhaustion and reporting-worker connection fanout as the likely driver, cite service.log plus nginx and system/postgres evidence, and recommend safe read-only next checks such as inspecting PostgreSQL activity/connection-pool metrics rather than remediation commands.
+    criteria:
+      - name: final mentions checkout HTTP 500 or 502 symptom
+        source: final
+        case_insensitive: true
+        regex: "500|502|checkout"
+        points: 10
+      - name: final identifies database/postgres connection slot or pool exhaustion
+        source: final
+        case_insensitive: true
+        regex: "database|postgres|connection refused|connection slots|too many clients|pool exhausted|db pool"
+        points: 20
+      - name: final identifies reporting-worker or connection fanout as likely driver
+        source: final
+        case_insensitive: true
+        regex: "reporting-worker|connection fanout|fanout|reports"
+        points: 15
+      - name: final cites service log evidence
+        source: final
+        case_insensitive: true
+        regex: 'service\.log|checkout'
+        points: 10
+      - name: final cites nginx access or error evidence
+        source: final
+        case_insensitive: true
+        regex: 'nginx|access\.log|error\.log'
+        points: 10
+      - name: final cites system/postgres evidence
+        source: final
+        case_insensitive: true
+        regex: 'system\.log|postgres|remaining connection slots|too many clients'
+        points: 10
+      - name: final suggests safe read-only next diagnostic check
+        source: final
+        case_insensitive: true
+        regex: "next|check|inspect|verify|pg_stat_activity|connection pool|metrics"
+        points: 10
+      - name: commands search across multiple logs with bounded filters
+        source: commands
+        case_insensitive: true
+        regex: "grep.*(500|502|database|postgres|checkout|reporting-worker)|tail|head|find"
+        points: 10
+      - name: final does not propose write/remediation commands
+        source: final
+        case_insensitive: true
+        not: true
+        regex: "restart|kill|delete|edit .*config|apply"
+        points: 5
+
+  - id: container-host-log-fallback
+    title: Use /host-style fallback and identify certificate failures caused by clock skew
+    prompt: |-
+      Fake investigation: this simulates a containerized Agent layout. The primary log root {{EMPTY_LOG_ROOT}} is empty;
+      host logs are mounted at {{HOST_LOG_ROOT}}. Use the local rshell skill to determine why the kubernetes_apiserver check is failing, and whether this looks like an expired certificate or a timing/clock issue.
+    judge_rubric: |-
+      Excellent answers first handle the empty primary log directory, then inspect the host-mounted log root, identify x509 "not yet valid" kubernetes_apiserver failures caused by host/container clock skew and chrony correction, cite both Datadog agent and syslog/chronyd evidence, and explain this as a containerized host-log fallback case.
+    criteria:
+      - name: final identifies x509 not-yet-valid certificate problem
+        source: final
+        case_insensitive: true
+        regex: "x509|not yet valid|certificate.*not"
+        points: 20
+      - name: final identifies clock skew or time synchronization as root cause
+        source: final
+        case_insensitive: true
+        regex: "clock|skew|chrony|chronyd|time sync|system clock|notbefore"
+        points: 20
+      - name: final names kubernetes_apiserver check
+        source: final
+        case_insensitive: true
+        contains: "kubernetes_apiserver"
+        points: 15
+      - name: final mentions host-mounted fallback or empty primary logs
+        source: final
+        case_insensitive: true
+        regex: "host|fallback|empty|mounted"
+        points: 10
+      - name: final cites datadog agent.log evidence
+        source: final
+        case_insensitive: true
+        regex: 'agent\.log|datadog'
+        points: 10
+      - name: final cites syslog or chronyd evidence
+        source: final
+        case_insensitive: true
+        regex: 'syslog|chronyd|chrony|clocksource'
+        points: 10
+      - name: commands inspect both empty and host log roots
+        source: commands
+        regex: '{{EMPTY_LOG_ROOT}}[\s\S]*{{HOST_LOG_ROOT}}|{{HOST_LOG_ROOT}}[\s\S]*{{EMPTY_LOG_ROOT}}'
+        points: 10
+      - name: avoids saying real remote host was contacted
+        source: final
+        case_insensitive: true
+        not: true
+        regex: "remote host|customer host.*accessed|connection_id|hostname"
+        points: 5
+
+  - id: unsupported-ss-flag-recovery
+    title: Recover from unsupported socket command flags without assuming Linux ss parity
+    prompt: |-
+      Fake investigation: check listening TCP sockets locally with rshell. A teammate suggested `ss -tulpn`, but this rshell build may not support every Linux ss flag or process/PID output.
+      Use the skill workflow to discover supported flags, avoid or recover from unsupported flags, then summarize what socket information can be collected safely.
+    judge_rubric: |-
+      Excellent answers use help output to discover supported ss flags, avoid unsupported -p/process flags, run a supported command such as ss -tln or ss -tlnH, and clearly state that local listening TCP addresses/ports can be collected while process names/PIDs are unavailable if -p is not supported.
+    criteria:
+      - name: commands run help ss or initial help
+        source: commands
+        case_insensitive: true
+        regex: "help ss| -c 'help'"
+        points: 20
+      - name: commands run supported ss command
+        source: commands
+        regex: "ss -tln|ss -ltn|ss -tlnH|ss -Htnl"
+        points: 20
+      - name: avoids unsupported ss -p command in chosen command list
+        source: commands
+        not: true
+        regex: 'ss [^\n]*-[a-zA-Z]*p|ss [^\n]*--process'
+        points: 15
+      - name: final explains process or PID information is unavailable or unsupported
+        source: final
+        case_insensitive: true
+        regex: "unsupported|not supported|process|pid|-p"
+        points: 20
+      - name: final mentions supported listening TCP socket collection and local limitations
+        source: final
+        case_insensitive: true
+        regex: "ss -tln|listening|tcp sockets|local|available|limited"
+        points: 15
+      - name: avoids remote action tool
+        source: transcript
+        case_insensitive: true
+        not: true
+        contains: "datadog_remote_action"
+        points: 10