coord run-log (full) — 2026-04-27 16:44#49939
coord run-log (full) — 2026-04-27 16:44#49939ellataira wants to merge 12 commits intoella/claude-coordinator-harnessfrom
Conversation
## Summary
Fixes `--seed` CLI flag in all four eval invoke tasks:
`q.eval-bayesian`, `q.eval-combinations`, `q.eval-pipeline`,
`q.eval-component`.
## Why
Invoke passes CLI args as strings when the default is `None`, regardless
of the `int` annotation. So `--seed 42` becomes the string `"42"` inside
the function.
Downstream behavior differs by library:
| Task | Downstream | Symptom |
|---|---|---|
| `eval_bayesian` | `optuna.TPESampler()` → numpy `RandomState.seed()` |
**Crashes** — `TypeError: Cannot cast scalar from dtype('<U2') to
dtype('int64')` |
| `eval_combinations`, `eval_pipeline`, `eval_component` | Python stdlib
`random.Random(seed)` | **Silent** — stdlib hashes the string, producing
a different RNG sequence than programmatic `seed=42` |
The silent cases are arguably worse: someone reproduces a run with
`--seed 42`, gets different results than before, and doesn't know why.
The docs at [Evals & fine
tuning](https://datadoghq.atlassian.net/wiki/spaces/agent/pages/6528369095/)
advertise `--seed 42` as a supported flag, so this is fixing documented
behavior that was never reliable from CLI.
## Change
Coerce `seed` to `int` at function entry in all 4 tasks:
```python
if seed is not None:
seed = int(seed)
```
## Test plan
- [x] Verified `dda inv --dep=optuna q.eval-bayesian --only bocpd
--n-trials 50 --seed 42 --scenarios ...` succeeds (manual reproduction)
- [x] Passing no `--seed` still works (the `None` path is unchanged)
- [ ] `q.eval-combinations`, `q.eval-pipeline`, `q.eval-component` not
individually re-run, but the fix is a pure coercion added before
existing behavior
🤖 Generated with [Claude Code](https://claude.com/claude-code)
### What does this PR do? This adds a new lookahead directory for custom gensim scenarios owned by q branch like this [one](DataDog/gensim-episodes#93). ### Motivation ### Describe how you validated your changes ### Additional Notes
## Summary The testbench's `GetLogAnomalies()` only matched anomalies with `AnomalyTypeLog`, so bocpd/scanmw/scanwelch detections on `log_pattern_extractor` and `log_metrics_extractor` series never appeared in `/api/log-anomalies` or `logAnomalyCount` — even though the live observer's `notify.go` already treats them as log anomalies via `isLogDerivedAnomaly()`. Apply the same `|| isLogDerivedAnomaly(a)` pattern the live path uses, keeping testbench and live observer classification logic in sync. No changes to detectors or `AnomalyType` assignments. ## Changes - `testbench.go`: extend log anomaly filter to include `isLogDerivedAnomaly(a)` alongside `AnomalyTypeLog` ## Test plan - [x] Builds clean (`go build ./comp/observer/...`) - [x] Loaded 221_base_kubelet parquet in testbench — `logAnomalyCount` went from 0 to 21, `/api/log-anomalies` returns bocpd detections on `log_pattern_extractor` series
### What does this PR do? Small utility because I'm lost sometimes... ### Motivation ### Describe how you validated your changes ### Additional Notes
…ics (#49903) ### What does this PR do? Adds \`observer.ingest_metrics.enabled\` (default \`true\`) to drop externally-ingested metrics at the handle factory while keeping logs and log-derived virtual metrics flowing. Wires the option into the gensim-eks runner via a \`--disable-metrics-ingestion\` flag on \`inv aws.eks.gensim.submit\`, setting both the env var and the Helm config field. ### Motivation Enables log-only anomaly detection runs on gensim-eks without external metrics. ### Describe how you validated your changes 461 observer unit tests pass. New tests cover drop behavior, virtual metric passthrough, and template rendering for the new flag. Manually tested on local build. ### Additional Notes --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### What does this PR do? This adds an option to disable metrics on eks task. It also cleans the latest code for that. ### Motivation ### Describe how you validated your changes ### Additional Notes --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…bserver-full-20260427T1644
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
iter 0 · Evaluating against detectors:
Budget: Run total: 4,867,379 tokens (~$72.37) (1.0% of 500,000,000 ceiling). Model mix: Opus 97%, Sonnet 3%. — Claude (coordinator harness) · |
|
Steering: propose a brand new correlator, not a variant of the existing time_cluster correlator (comp/observer/impl/anomaly_correlator_time_cluster.go). Don't extend, wrap, or parameterize it — design a structurally different correlator that combines detector outputs via a different mechanism. Research relevant recent literature. Keep in mind performance. A more complex correlator is not always more effective. TimeCluster is effective AND simple, but we want to find something novel and better. Must implement the existing correlator interface cleanly (see time_cluster.go for the contract, not the algorithm) and register via component_catalog.go. Target this correlator as the single target_component, eval against all three detectors, perf ~1.5× time_cluster. |
|
iter 0 · Evaluating against detectors:
Budget: Run total: 6,670,059 tokens (~$77.79) (1.3% of 500,000,000 ceiling). Model mix: Opus 70%, Sonnet 30%. — Claude (coordinator harness) · |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
iter 0 · Evaluating against detectors:
Budget: Run total: 7,125,049 tokens (~$79.16) (1.4% of 500,000,000 ceiling). Model mix: Opus 66%, Sonnet 34%. — Claude (coordinator harness) · |
|
iter 0 · Evaluating against detectors:
Budget: Run total: 7,216,794 tokens (~$79.44) (1.4% of 500,000,000 ceiling). Model mix: Opus 65%, Sonnet 35%. — Claude (coordinator harness) · |
|
iter 0 · Evaluating against detectors:
Budget: Run total: 7,275,378 tokens (~$79.61) (1.5% of 500,000,000 ceiling). Model mix: Opus 65%, Sonnet 35%. — Claude (coordinator harness) · |
|
iter 0 · Evaluating against detectors:
Budget: Run total: 13,004,979 tokens (~$123.84) (2.6% of 500,000,000 ceiling). Model mix: Opus 52%, Sonnet 48%. — Claude (coordinator harness) · |
|
iter 0 · Detail: This usually means workspace deps are broken ( Iteration aborted; no SDK tokens spent. Budget: Run total: 13,004,979 tokens (~$123.84) (2.6% of 500,000,000 ceiling). Model mix: Opus 52%, Sonnet 48%. — Claude (coordinator harness) · |
… into claude/observer-full-20260427T1644
|
Protected path 'comp/observer/scenarios' changed between iterations (recorded 8aa2ddbe6f38 vs current a82fc54e8b96). This is the scoring label set; any change invalidates mid-run F1 measurements. Halting. If the change is a legitimate upstream merge, delete db.protected_path_hashes and restart to re-baseline. — Claude (coordinator harness) · |
… into claude/observer-full-20260427T1644
|
iter 1 · Evaluating against detectors:
Budget: Run total: 13,004,979 tokens (~$123.84) (2.6% of 500,000,000 ceiling). Model mix: Opus 52%, Sonnet 48%. — Claude (coordinator harness) · |
|
🚨 iter 1 · Gate failures: strict_regressions=[] recall_violations=[] first_ship_floor_breached=mean_f1 0.2210 < 0.25 (no prior ship — quality bar for first commit on a blank-baseline run) Top 5 |ΔF1| scenarios:
Observed mean F1 0.2210 vs baseline 0.1160 (Δ+0.1050). Total FPs 53 → 49 (Δ-4). Working tree reverted; no commit. Budget: This iter: 2,166,939 in / 36,209 out ( — Claude (coordinator harness) · |
|
iter 2 · Evaluating against detectors:
Budget: Run total: 15,208,127 tokens (~$130.88) (3.0% of 500,000,000 ceiling). Model mix: Opus 44%, Sonnet 56%. — Claude (coordinator harness) · |
… into claude/observer-full-20260427T1644
|
iter 2 · Evaluating against detectors:
Budget: Run total: 15,392,940 tokens (~$131.44) (3.1% of 500,000,000 ceiling). Model mix: Opus 44%, Sonnet 56%. — Claude (coordinator harness) · |
|
🚨 iter 2 · Gate failures: strict_regressions=[] recall_violations=[] first_ship_floor_breached=mean_f1 0.1037 < 0.25 (no prior ship — quality bar for first commit on a blank-baseline run) Top 5 |ΔF1| scenarios:
Observed mean F1 0.1037 vs baseline 0.1035 (Δ+0.0001). Total FPs 406 → 401 (Δ-5). Working tree reverted; no commit. Budget: This iter: 1,084,560 in / 4,870 out ( — Claude (coordinator harness) · |
|
iter 3 · Evaluating against detectors:
Budget: Run total: 16,297,557 tokens (~$134.21) (3.3% of 500,000,000 ceiling). Model mix: Opus 41%, Sonnet 59%. — Claude (coordinator harness) · |
|
🚨 iter 3 · Gate failures: strict_regressions=[] recall_violations=[] first_ship_floor_breached=mean_f1 0.1231 < 0.25 (no prior ship — quality bar for first commit on a blank-baseline run) Top 5 |ΔF1| scenarios:
Observed mean F1 0.1231 vs baseline 0.1259 (Δ-0.0027). Total FPs 528 → 525 (Δ-3). Working tree reverted; no commit. Budget: This iter: 1,133,038 in / 5,001 out ( — Claude (coordinator harness) · |
|
❌ iter 10 · Mean F1 0.1160 → 0.1160 (Δ+0.0000). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 4,974,675 in / 43,696 out ( — Claude (coordinator harness) · |
|
💸 iter 10 cost: $32.10 (5,018,371 tokens). Rolling mean (last 5): $15.13. Triggers:
Streak: 2 consecutive anomalous iter(s) (auto-pause at 3). — Claude (coordinator harness) · |
|
iter 11 · Evaluating against detectors:
Budget: Run total: 51,581,377 tokens (~$368.91) (10.3% of 500,000,000 ceiling). Model mix: Opus 32%, Sonnet 68%. — Claude (coordinator harness) · |
|
🚨 iter 11 · Gate failures: strict_regressions=['scanwelch/213_pagerduty'] recall_violations=['scanwelch/093_cloudflare', 'scanwelch/211_doordash', 'scanwelch/213_pagerduty', 'scanwelch/546_cloudflare', 'scanwelch/ehr_pgbouncer'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.2260 vs baseline 0.1259 (Δ+0.1002). Total FPs 528 → 84 (Δ-444). Working tree reverted; no commit. Budget: This iter: 1,618,284 in / 16,489 out ( — Claude (coordinator harness) · |
|
🔔 Phase 1 plateaued after 5 consecutive non-improving iterations. Best score so far: 0.2210. Pivot #2. Banned (newly added): ['density-ratio-postfilter', 'hac-autocorrelation-correction', 'linear-detrend-postfilter', 'permutation-bootstrap-postfilter', 'spectral-residual-detector'] Coordinator auto-pivoting: the proposer will generate new candidates with the banned families filtered out. (Write — Claude (coordinator harness) · |
|
iter 12 · Evaluating against detectors:
Budget: Run total: 55,484,331 tokens (~$409.69) (11.1% of 500,000,000 ceiling). Model mix: Opus 34%, Sonnet 66%. — Claude (coordinator harness) · |
|
❌ iter 12 · Mean F1 0.1160 → 0.1246 (Δ+0.0085). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 7,323,344 in / 218,984 out ( — Claude (coordinator harness) · |
|
💸 iter 12 cost: $50.14 (7,542,328 tokens). Rolling mean (last 5): $17.77. Triggers:
Streak: 1 consecutive anomalous iter(s) (auto-pause at 3). — Claude (coordinator harness) · |
|
iter 13 · Evaluating against detectors:
Budget: Run total: 63,026,659 tokens (~$459.83) (12.6% of 500,000,000 ceiling). Model mix: Opus 33%, Sonnet 67%. — Claude (coordinator harness) · |
|
🚨 iter 13 · Gate failures: strict_regressions=['scanmw/063_twilio', 'scanmw/213_pagerduty'] recall_violations=[] fp_ceiling_breached=624 > 609 (ratio 1.5× baseline 406) Top 5 |ΔF1| scenarios:
Observed mean F1 0.0580 vs baseline 0.1035 (Δ-0.0455). Total FPs 406 → 624 (Δ+218). Working tree reverted; no commit. Budget: This iter: 3,695,748 in / 80,156 out ( — Claude (coordinator harness) · |
|
iter 14 · Evaluating against detectors:
Budget: Run total: 66,802,563 tokens (~$472.12) (13.4% of 500,000,000 ceiling). Model mix: Opus 31%, Sonnet 69%. — Claude (coordinator harness) · |
|
🚨 iter 14 · Gate failures: strict_regressions=['scanwelch/063_twilio', 'scanwelch/211_doordash', 'scanwelch/213_pagerduty', 'scanwelch/221_base', 'scanwelch/703_shopify', 'scanwelch/food_delivery_redis'] recall_violations=['scanwelch/093_cloudflare', 'scanwelch/213_pagerduty'] fp_ceiling_breached=800 > 792 (ratio 1.5× baseline 528) Top 5 |ΔF1| scenarios:
Observed mean F1 0.0236 vs baseline 0.1259 (Δ-0.1022). Total FPs 528 → 800 (Δ+272). Working tree reverted; no commit. Budget: This iter: 2,112,591 in / 158,020 out ( — Claude (coordinator harness) · |
|
iter 15 · Evaluating against detectors:
Budget: Run total: 71,124,490 tokens (~$513.54) (14.2% of 500,000,000 ceiling). Model mix: Opus 32%, Sonnet 68%. — Claude (coordinator harness) · |
|
🚨 iter 15 · Gate failures: strict_regressions=['bocpd/059_fortnite', 'bocpd/703_shopify'] recall_violations=['bocpd/059_fortnite'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.1643 vs baseline 0.1160 (Δ+0.0483). Total FPs 53 → 58 (Δ+5). Working tree reverted; no commit. Budget: This iter: 2,295,867 in / 66,075 out ( — Claude (coordinator harness) · |
|
iter 16 · Evaluating against detectors:
Budget: Run total: 73,486,432 tokens (~$521.42) (14.7% of 500,000,000 ceiling). Model mix: Opus 31%, Sonnet 69%. — Claude (coordinator harness) · |
|
❌ iter 16 · Mean F1 0.1035 → 0.1035 (Δ+0.0000). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 4,961,380 in / 32,523 out ( — Claude (coordinator harness) · |
|
💸 iter 16 cost: $53.34 (4,993,903 tokens). Rolling mean (last 5): $16.82. Triggers:
Streak: 1 consecutive anomalous iter(s) (auto-pause at 3). — Claude (coordinator harness) · |
|
🔔 Phase 1 plateaued after 5 consecutive non-improving iterations. Best score so far: 0.2210. Pivot #3. Banned (newly added): ['cross-series-common-mode-rejection', 'evt-spot-self-calibrating-threshold', 'kernel-mmd-changepoint', 'matrix-profile-discord', 'seasonal-adaptive-hazard-bocpd'] Coordinator auto-pivoting: the proposer will generate new candidates with the banned families filtered out. (Write — Claude (coordinator harness) · |
|
iter 17 · Evaluating against detectors:
Budget: Run total: 78,480,335 tokens (~$574.76) (15.7% of 500,000,000 ceiling). Model mix: Opus 33%, Sonnet 67%. — Claude (coordinator harness) · |
|
❌ iter 17 · Mean F1 0.1259 → 0.1216 (Δ-0.0043). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 3,691,588 in / 35,762 out ( — Claude (coordinator harness) · |
|
iter 18 · Evaluating against detectors:
Budget: Run total: 84,453,350 tokens (~$647.08) (16.9% of 500,000,000 ceiling). Model mix: Opus 36%, Sonnet 64%. — Claude (coordinator harness) · |
|
❌ iter 18 · Mean F1 0.1160 → 0.1160 (Δ+0.0000). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 6,101,455 in / 46,674 out ( — Claude (coordinator harness) · |
|
💸 iter 18 cost: $61.30 (6,148,129 tokens). Rolling mean (last 5): $23.77. Triggers:
Streak: 1 consecutive anomalous iter(s) (auto-pause at 3). — Claude (coordinator harness) · |
|
iter 19 · Evaluating against detectors:
Budget: Run total: 90,601,479 tokens (~$708.37) (18.1% of 500,000,000 ceiling). Model mix: Opus 37%, Sonnet 63%. — Claude (coordinator harness) · |
|
🚨 iter 19 · Gate failures: strict_regressions=['scanmw/063_twilio', 'scanmw/213_pagerduty'] recall_violations=['scanmw/546_cloudflare'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.0483 vs baseline 0.1035 (Δ-0.0553). Total FPs 406 → 556 (Δ+150). Working tree reverted; no commit. Budget: This iter: 2,219,377 in / 25,498 out ( — Claude (coordinator harness) · |
|
iter 20 · Evaluating against detectors:
Budget: Run total: 92,846,354 tokens (~$715.41) (18.6% of 500,000,000 ceiling). Model mix: Opus 36%, Sonnet 64%. — Claude (coordinator harness) · |
|
🚨 iter 20 · Gate failures: strict_regressions=['scanwelch/063_twilio', 'scanwelch/213_pagerduty', 'scanwelch/food_delivery_redis'] recall_violations=['scanwelch/093_cloudflare'] Top 5 |ΔF1| scenarios:
Observed mean F1 0.1205 vs baseline 0.1259 (Δ-0.0054). Total FPs 528 → 536 (Δ+8). Working tree reverted; no commit. Budget: This iter: 3,945,250 in / 66,416 out ( — Claude (coordinator harness) · |
|
iter 21 · Evaluating against detectors:
Budget: Run total: 98,929,769 tokens (~$761.26) (19.8% of 500,000,000 ceiling). Model mix: Opus 36%, Sonnet 64%. — Claude (coordinator harness) · |
|
❌ iter 21 · Mean F1 0.1160 → 0.1143 (Δ-0.0018). Reviewer verdicts:
Working tree reverted; no commit. Budget: This iter: 4,911,793 in / 57,933 out ( — Claude (coordinator harness) · |
|
🏁 Phase 1 plateaued after 5 consecutive non-improving iterations. Best score so far: 0.2210. Pivot #4. Banned (newly added): ['ar1-predictive-bocpd', 'dispersion-shift-conover', 'martingale-conformal-postfilter', 'multiscale-local-window', 'runlength-entropy-bocpd']
— Claude (coordinator harness) · |
Coordinator harness run-log.
fullclaude/observer-full-20260427T1644ella/claude-coordinator-harness(harness + observer; PR diff = ships only)q-branch-observerThis PR is the bidirectional control channel: status comments posted by the coordinator; steering comments by the operator. Each shipped candidate becomes one commit on this branch, making the run a self-contained eval-matrix entry diffable against
ella/claude-coordinator-harness.