Skip to content

gameplay_capture: carry known_good action as supervised label#23

Merged
dp-web4 merged 1 commit intomainfrom
router/gameplay-known-good-labels
Apr 18, 2026
Merged

gameplay_capture: carry known_good action as supervised label#23
dp-web4 merged 1 commit intomainfrom
router/gameplay-known-good-labels

Conversation

@dp-web4
Copy link
Copy Markdown
Owner

@dp-web4 dp-web4 commented Apr 18, 2026

Summary

Per Dennis's observation: the gameplay records are supervised training triples, but we were only capturing what our baseline PROPOSED, not what actually was the right move. The winning trace's per-step action is by definition a good next action — encode it in metadata so downstream training can use it as the teacher signal.

What this ships

Three new fields on every gameplay record's metadata:

  • known_good_action: int — GameAction value the winning trace took
  • known_good_data: Dict|None — click coords for CLICK actions, None otherwise
  • known_good_level: int — game level at this step

What this unlocks

Training task Target State
Router BC baseline_dispatch already worked
Action prediction known_good_action NEW — direct action-level supervision
Motor-skill BC by demonstration known_good_action given (state, skill_params) NEW — when motor-skills land
Outcome-weighted shaping sample_weight ∝ game_outcome.won NEW
Backprop through chained components terminal-loss on winning action NEW — whole-stack gradient

"This is what SAGE should do next to evaluate what it proposes next" — the proposal and the ground truth are now both in every record.

Backward compatibility

RouterRecord schema unchanged (metadata is an open dict). Old consumers that don't look at the new fields are unaffected. The PR #21 records emitted before this merge don't have the new labels, but that's 148 records on CBP — easy to re-capture via fleet_gameplay_capture.sh.

Tests

2 new unit tests (18 total in the module). Verify:

  • known_good_action exactly matches the trace action (1/3/6 for UP/LEFT/CLICK)
  • Click-step known_good_data carries {x, y} coords
  • known_good_level passes through correctly

Recommendation post-merge

Re-run fleet_gameplay_capture.sh on machines that already ran it (currently CBP only) so their records get upgraded with the new labels. Machines that haven't run it yet just get the labels natively on first run.

🤖 Generated with Claude Code

Per Dennis's observation: the gameplay records ARE supervised training
triples, but we were only capturing what our baseline PROPOSED, not
what actually was the right move. The winning trace's per-step action
is by definition a good next action — encode it in metadata so
downstream training can use it as the teacher signal.

New metadata fields on every gameplay record:
- known_good_action: int  (GameAction value that the winning trace took)
- known_good_data: Dict|None  (click coords for action=6, else None)
- known_good_level: int  (game level at this step)

What this unlocks for training:
- Router BC: (state → baseline_dispatch)      [already worked]
- Action prediction: (state → known_good_action)      [NEW]
- Motor-skill BC by demonstration: (state × skill_params → known_good_action)  [NEW]
- Outcome-weighted shaping: sample_weight ∝ game_outcome.won   [NEW]
- Backprop through chained components using the winning action
  as the terminal-loss target                                   [NEW]

'This is what SAGE should do next to evaluate what it proposes next' —
the proposal and the ground truth are now both in every record.

Tests: 2 new (18 total). Verify known_good_action matches trace action
exactly, click-step known_good_data carries coords, levels pass through.

Backward compat: old consumers that don't look at the new fields are
unaffected. RouterRecord schema unchanged (metadata is open dict).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dp-web4 dp-web4 merged commit 3537a9e into main Apr 18, 2026
@dp-web4 dp-web4 deleted the router/gameplay-known-good-labels branch April 18, 2026 20:28
dp-web4 pushed a commit that referenced this pull request Apr 28, 2026
…subtraction. Phi4 register-substitution discovered (Δpol -3.36, Δbiz +1.08 same trajectory). Hardware register quantified — Thor Δhw +2.46 largest single Δ, positive across all 8 raised instances. CBP basin = TED+gov+marketing combo. Lexicon substring FP bug fixed (recurrence #9 of S110 pattern at analysis layer). S119 #18/#19/#20 executed; #21/#22/#23/#24 held.

Machine: localhost.localdomain
Date: 2026-04-28 01:13:05 UTC

Changes committed automatically at session end.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant