gameplay_capture: carry known_good action as supervised label#23
Merged
gameplay_capture: carry known_good action as supervised label#23
Conversation
Per Dennis's observation: the gameplay records ARE supervised training triples, but we were only capturing what our baseline PROPOSED, not what actually was the right move. The winning trace's per-step action is by definition a good next action — encode it in metadata so downstream training can use it as the teacher signal. New metadata fields on every gameplay record: - known_good_action: int (GameAction value that the winning trace took) - known_good_data: Dict|None (click coords for action=6, else None) - known_good_level: int (game level at this step) What this unlocks for training: - Router BC: (state → baseline_dispatch) [already worked] - Action prediction: (state → known_good_action) [NEW] - Motor-skill BC by demonstration: (state × skill_params → known_good_action) [NEW] - Outcome-weighted shaping: sample_weight ∝ game_outcome.won [NEW] - Backprop through chained components using the winning action as the terminal-loss target [NEW] 'This is what SAGE should do next to evaluate what it proposes next' — the proposal and the ground truth are now both in every record. Tests: 2 new (18 total). Verify known_good_action matches trace action exactly, click-step known_good_data carries coords, levels pass through. Backward compat: old consumers that don't look at the new fields are unaffected. RouterRecord schema unchanged (metadata is open dict). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dp-web4
pushed a commit
that referenced
this pull request
Apr 28, 2026
…subtraction. Phi4 register-substitution discovered (Δpol -3.36, Δbiz +1.08 same trajectory). Hardware register quantified — Thor Δhw +2.46 largest single Δ, positive across all 8 raised instances. CBP basin = TED+gov+marketing combo. Lexicon substring FP bug fixed (recurrence #9 of S110 pattern at analysis layer). S119 #18/#19/#20 executed; #21/#22/#23/#24 held. Machine: localhost.localdomain Date: 2026-04-28 01:13:05 UTC Changes committed automatically at session end. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per Dennis's observation: the gameplay records are supervised training triples, but we were only capturing what our baseline PROPOSED, not what actually was the right move. The winning trace's per-step action is by definition a good next action — encode it in metadata so downstream training can use it as the teacher signal.
What this ships
Three new fields on every gameplay record's metadata:
known_good_action: int— GameAction value the winning trace tookknown_good_data: Dict|None— click coords for CLICK actions, None otherwiseknown_good_level: int— game level at this stepWhat this unlocks
baseline_dispatchknown_good_actionknown_good_actiongiven(state, skill_params)sample_weight ∝ game_outcome.wonBackward compatibility
RouterRecord schema unchanged (metadata is an open dict). Old consumers that don't look at the new fields are unaffected. The PR #21 records emitted before this merge don't have the new labels, but that's 148 records on CBP — easy to re-capture via
fleet_gameplay_capture.sh.Tests
2 new unit tests (18 total in the module). Verify:
known_good_actionexactly matches the trace action (1/3/6 for UP/LEFT/CLICK)known_good_datacarries{x, y}coordsknown_good_levelpasses through correctlyRecommendation post-merge
Re-run
fleet_gameplay_capture.shon machines that already ran it (currently CBP only) so their records get upgraded with the new labels. Machines that haven't run it yet just get the labels natively on first run.🤖 Generated with Claude Code