coord run-log (full) — 2026-04-28 14:59#50013
coord run-log (full) — 2026-04-28 14:59#50013ellataira wants to merge 9 commits intoella/claude-coordinator-harnessfrom
Conversation
## Summary
Fixes `--seed` CLI flag in all four eval invoke tasks:
`q.eval-bayesian`, `q.eval-combinations`, `q.eval-pipeline`,
`q.eval-component`.
## Why
Invoke passes CLI args as strings when the default is `None`, regardless
of the `int` annotation. So `--seed 42` becomes the string `"42"` inside
the function.
Downstream behavior differs by library:
| Task | Downstream | Symptom |
|---|---|---|
| `eval_bayesian` | `optuna.TPESampler()` → numpy `RandomState.seed()` |
**Crashes** — `TypeError: Cannot cast scalar from dtype('<U2') to
dtype('int64')` |
| `eval_combinations`, `eval_pipeline`, `eval_component` | Python stdlib
`random.Random(seed)` | **Silent** — stdlib hashes the string, producing
a different RNG sequence than programmatic `seed=42` |
The silent cases are arguably worse: someone reproduces a run with
`--seed 42`, gets different results than before, and doesn't know why.
The docs at [Evals & fine
tuning](https://datadoghq.atlassian.net/wiki/spaces/agent/pages/6528369095/)
advertise `--seed 42` as a supported flag, so this is fixing documented
behavior that was never reliable from CLI.
## Change
Coerce `seed` to `int` at function entry in all 4 tasks:
```python
if seed is not None:
seed = int(seed)
```
## Test plan
- [x] Verified `dda inv --dep=optuna q.eval-bayesian --only bocpd
--n-trials 50 --seed 42 --scenarios ...` succeeds (manual reproduction)
- [x] Passing no `--seed` still works (the `None` path is unchanged)
- [ ] `q.eval-combinations`, `q.eval-pipeline`, `q.eval-component` not
individually re-run, but the fix is a pure coercion added before
existing behavior
🤖 Generated with [Claude Code](https://claude.com/claude-code)
### What does this PR do? This adds a new lookahead directory for custom gensim scenarios owned by q branch like this [one](DataDog/gensim-episodes#93). ### Motivation ### Describe how you validated your changes ### Additional Notes
## Summary The testbench's `GetLogAnomalies()` only matched anomalies with `AnomalyTypeLog`, so bocpd/scanmw/scanwelch detections on `log_pattern_extractor` and `log_metrics_extractor` series never appeared in `/api/log-anomalies` or `logAnomalyCount` — even though the live observer's `notify.go` already treats them as log anomalies via `isLogDerivedAnomaly()`. Apply the same `|| isLogDerivedAnomaly(a)` pattern the live path uses, keeping testbench and live observer classification logic in sync. No changes to detectors or `AnomalyType` assignments. ## Changes - `testbench.go`: extend log anomaly filter to include `isLogDerivedAnomaly(a)` alongside `AnomalyTypeLog` ## Test plan - [x] Builds clean (`go build ./comp/observer/...`) - [x] Loaded 221_base_kubelet parquet in testbench — `logAnomalyCount` went from 0 to 21, `/api/log-anomalies` returns bocpd detections on `log_pattern_extractor` series
### What does this PR do? Small utility because I'm lost sometimes... ### Motivation ### Describe how you validated your changes ### Additional Notes
…ics (#49903) ### What does this PR do? Adds \`observer.ingest_metrics.enabled\` (default \`true\`) to drop externally-ingested metrics at the handle factory while keeping logs and log-derived virtual metrics flowing. Wires the option into the gensim-eks runner via a \`--disable-metrics-ingestion\` flag on \`inv aws.eks.gensim.submit\`, setting both the env var and the Helm config field. ### Motivation Enables log-only anomaly detection runs on gensim-eks without external metrics. ### Describe how you validated your changes 461 observer unit tests pass. New tests cover drop behavior, virtual metric passthrough, and template rendering for the new flag. Manually tested on local build. ### Additional Notes --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### What does this PR do? This adds an option to disable metrics on eks task. It also cleans the latest code for that. ### Motivation ### Describe how you validated your changes ### Additional Notes --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
… without logs agent (#49481) ## Summary - New component at `comp/observer/logssource/` that pipes container logs into the observer when `logs_enabled: false`, with no logs shipped to the Datadog backend - Discovers containers automatically via workloadmeta — no customer annotations required - Reuses the existing container launcher, processor, and auditor; adds a drain-only pipeline (no network sender) ## Design - **Gated**: no-ops when observer/workloadmeta are unavailable. No new config flag. - **Filtering**: skips pause containers (image name heuristic) and agent-own containers to avoid feedback loops - **Container ID idempotency**: tracks active sources by container ID to prevent dual-tailer on repeated workloadmeta Set events - **Ordered shutdown**: `cancel → sp.wait() → launchersMgr.Stop() → proc.Stop() → close(outputChan) → <-drainDone` — prevents deadlock between launcher writes and processor drain ## Test plan - [x] Unit tests (`dda inv test --targets=./comp/observer/logssource/...`) - Filter function coverage (pause, agent image detection) - `handleSet`/`handleUnset` behavior (running filter, idempotency, re-add after removal) - Goroutine lifecycle — `goleak` verifies clean exit after cancel+wait - [x] Live EKS run: confirmed all relevant containers tailed on startup, logs delivering to observer (40k+ logs/run in recorded parquets) - [x] Kubelet journald logs flow into observer --------- Co-authored-by: Célian Raimbault <161456554+CelianR@users.noreply.github.com> Co-authored-by: Celian Raimbault <celian.raimbault@datadoghq.com>
…bserver-full-20260428T1459
|
iter 0 · Evaluating against detectors:
Budget: Run total: 3,301,173 tokens (~$50.73) (0.7% of 500,000,000 ceiling). Model mix: Opus 100%, Sonnet 0%. — Claude (coordinator harness) · |
Gitlab CI Configuration Changes
|
| Removed | Modified | Added | Renamed |
|---|---|---|---|
| 0 | 361 | 0 | 0 |
Updated: .gitlab/distribution.yml
Changes Summary
| Removed | Modified | Added | Renamed |
|---|---|---|---|
| 0 | 0 | 2 | 0 |
ℹ️ Diff available in the job log.
Go Package Import DifferencesBaseline: e5b320d
|
|
iter 0 · Detail: This usually means workspace deps are broken ( Iteration aborted; no SDK tokens spent. Budget: Run total: 3,301,173 tokens (~$50.73) (0.7% of 500,000,000 ceiling). Model mix: Opus 100%, Sonnet 0%. — Claude (coordinator harness) · |
|
iter 1 · Evaluating against detectors:
Budget: Run total: 3,471,224 tokens (~$51.25) (0.7% of 500,000,000 ceiling). Model mix: Opus 95%, Sonnet 5%. — Claude (coordinator harness) · |
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
On-wire sizes (compressed)
|
|
iter 1 · Evaluating against detectors:
Budget: Run total: 3,621,773 tokens (~$51.70) (0.7% of 500,000,000 ceiling). Model mix: Opus 91%, Sonnet 9%. — Claude (coordinator harness) · |
|
❌ iter 1 · Implementer summary: Working tree reverted; moving on. The proposer will see this as a rejection reason on the next iter. Budget: This iter: 4,706,998 in / 21,957 out ( — Claude (coordinator harness) · |
|
iter 2 · Evaluating against detectors:
Budget: Run total: 8,263,744 tokens (~$122.65) (1.7% of 500,000,000 ceiling). Model mix: Opus 96%, Sonnet 4%. — Claude (coordinator harness) · |
|
❌ iter 2 · Implementer summary: Working tree reverted; moving on. The proposer will see this as a rejection reason on the next iter. Budget: This iter: 3,795,875 in / 16,101 out ( — Claude (coordinator harness) · |
|
iter 3 · Evaluating against detectors:
Budget: Run total: 12,075,720 tokens (~$180.79) (2.4% of 500,000,000 ceiling). Model mix: Opus 97%, Sonnet 3%. — Claude (coordinator harness) · |
|
❌ iter 3 · Implementer summary: Working tree reverted; moving on. The proposer will see this as a rejection reason on the next iter. Budget: This iter: 3,167,050 in / 17,072 out ( — Claude (coordinator harness) · |
|
iter 4 · Evaluating against detectors:
Budget: Run total: 15,259,842 tokens (~$229.58) (3.1% of 500,000,000 ceiling). Model mix: Opus 98%, Sonnet 2%. — Claude (coordinator harness) · |
|
❌ iter 4 · Mean F1 0.1160 → 0.1160 (Δ+0.0000). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 2,510,530 in / 47,630 out ( — Claude (coordinator harness) · |
|
iter 5 · Evaluating against detectors:
Budget: Run total: 17,818,002 tokens (~$250.60) (3.6% of 500,000,000 ceiling). Model mix: Opus 90%, Sonnet 10%. — Claude (coordinator harness) · |
|
🚨 iter 5 · Gate failures: strict_regressions=['scanmw/213_pagerduty'] recall_violations=['scanmw/093_cloudflare', 'scanmw/213_pagerduty', 'scanmw/546_cloudflare', 'scanmw/casino_postgresql', 'scanmw/ehr_pgbouncer', 'scanmw/food_delivery_redis'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.2684 vs baseline 0.1035 (Δ+0.1648). Total FPs 406 → 27 (Δ-379). Working tree reverted; no commit. Budget: This iter: 6,291,795 in / 63,163 out ( — Claude (coordinator harness) · |
|
iter 6 · Evaluating against detectors:
Budget: Run total: 24,172,960 tokens (~$270.43) (4.8% of 500,000,000 ceiling). Model mix: Opus 66%, Sonnet 34%. — Claude (coordinator harness) · |
|
❌ iter 6 · Mean F1 0.1259 → 0.1217 (Δ-0.0042). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 1,957,099 in / 19,793 out ( — Claude (coordinator harness) · |
|
iter 7 · Evaluating against detectors:
Budget: Run total: 28,173,996 tokens (~$318.42) (5.6% of 500,000,000 ceiling). Model mix: Opus 67%, Sonnet 33%. — Claude (coordinator harness) · |
|
iter 7 · Evaluating against detectors:
Budget: Run total: 30,597,968 tokens (~$326.01) (6.1% of 500,000,000 ceiling). Model mix: Opus 62%, Sonnet 38%. — Claude (coordinator harness) · |
|
🚨 iter 7 · Gate failures: strict_regressions=['bocpd/casino_postgresql'] recall_violations=['bocpd/353_postmark', 'bocpd/casino_postgresql', 'bocpd/ehr_pgbouncer'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.0889 vs baseline 0.1160 (Δ-0.0272). Total FPs 53 → 30 (Δ-23). Working tree reverted; no commit. Budget: This iter: 4,712,040 in / 42,081 out ( — Claude (coordinator harness) · |
|
iter 8 · Evaluating against detectors:
Budget: Run total: 32,928,117 tokens (~$333.19) (6.6% of 500,000,000 ceiling). Model mix: Opus 57%, Sonnet 43%. — Claude (coordinator harness) · |
|
🚨 iter 8 · Gate failures: strict_regressions=[] recall_violations=['scanmw/546_cloudflare'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.1213 vs baseline 0.1035 (Δ+0.0177). Total FPs 406 → 432 (Δ+26). Working tree reverted; no commit. Budget: This iter: 5,452,521 in / 154,398 out ( — Claude (coordinator harness) · |
|
iter 9 · Evaluating against detectors:
Budget: Run total: 38,535,036 tokens (~$351.86) (7.7% of 500,000,000 ceiling). Model mix: Opus 49%, Sonnet 51%. — Claude (coordinator harness) · |
|
❌ iter 9 · Mean F1 0.1259 → 0.1259 (Δ+0.0000). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 5,538,684 in / 97,716 out ( — Claude (coordinator harness) · |
|
💸 iter 9 cost: $44.76 (5,636,400 tokens). Rolling mean (last 5): $18.16. Triggers:
Streak: 1 consecutive anomalous iter(s) (auto-pause at 3). — Claude (coordinator harness) · |
|
iter 10 · Evaluating against detectors:
Budget: Run total: 46,549,035 tokens (~$433.98) (9.3% of 500,000,000 ceiling). Model mix: Opus 50%, Sonnet 50%. — Claude (coordinator harness) · |
|
🚨 iter 10 · Gate failures: strict_regressions=['scanwelch/063_twilio', 'scanwelch/211_doordash', 'scanwelch/213_pagerduty', 'scanwelch/221_base', 'scanwelch/food_delivery_redis'] recall_violations=['scanwelch/063_twilio', 'scanwelch/093_cloudflare', 'scanwelch/211_doordash', 'scanwelch/213_pagerduty', 'scanwelch/221_base', 'scanwelch/546_cloudflare', 'scanwelch/casino_postgresql', 'scanwelch/ehr_pgbouncer', 'scanwelch/food_delivery_redis'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.0225 vs baseline 0.1259 (Δ-0.1034). Total FPs 528 → 176 (Δ-352). Working tree reverted; no commit. Budget: This iter: 5,198,403 in / 86,611 out ( — Claude (coordinator harness) · |
|
🔔 Phase 1 plateaued after 5 consecutive non-improving iterations. Best score so far: 0.2684. Pivot #1. Banned (newly added): ['bocpd-hampel-robust-likelihood', 'conformal-martingale-postfilter', 'seasonal-bucket-detrending', 'signal-shape-router', 'tail-confirmation-postfilter'] Coordinator auto-pivoting: the proposer will generate new candidates with the banned families filtered out. (Write — Claude (coordinator harness) · |
|
iter 11 · Evaluating against detectors:
Budget: Run total: 54,472,862 tokens (~$492.26) (10.9% of 500,000,000 ceiling). Model mix: Opus 48%, Sonnet 52%. — Claude (coordinator harness) · |
Coordinator harness run-log.
seeded filters
fullclaude/observer-full-20260428T1459ella/claude-coordinator-harness(harness + observer; PR diff = ships only)q-branch-observerThis PR is the bidirectional control channel: status comments posted by the coordinator; steering comments by the operator. Each shipped candidate becomes one commit on this branch, making the run a self-contained eval-matrix entry diffable against
ella/claude-coordinator-harness.