[graph_trainer] Add nightly self-improvement scout and first report by SherlockNoMad · Pull Request #2806 · pytorch/torchtitan

SherlockNoMad · 2026-04-03T06:55:36Z

Stack from ghstack (oldest at bottom):

Add a nightly prompt (.claude/nightly.md) designed to be run by Claude Code
to discover self-improvement opportunities — not breakage detection (CI
handles that), but things like upstream API drift, test coverage gaps,
unblocked TODOs, and code freshness issues.

The scout covers 7 areas:

Core torchtitan delta review (opportunity/risk from upstream changes)
TODO unblock detection (11 tracked TODOs with upstream blockers)
Test coverage gap analysis (vs core's test matrix)
Performance opportunity discovery
Code freshness & technical debt
Documentation freshness
Open work tracking

Reports are written to graph_trainer/reports/YYYY-MM-DD.md.

The first report (2026-04-02) surfaces two P0 findings:

Llama3 parallelize.py missing enable_cp/enable_sp in apply_tp() call,
meaning context parallelism silently malfunctions despite README claiming
CP support
fsdp_reshard_after_fwd_pass has zero test coverage (no unit or
integration test)

It took 6 min to generate the first report. Not bad.

Add a nightly prompt (.claude/nightly.md) designed to be run by Claude Code to discover self-improvement opportunities — not breakage detection (CI handles that), but things like upstream API drift, test coverage gaps, unblocked TODOs, and code freshness issues. The scout covers 7 areas: 1. Core torchtitan delta review (opportunity/risk from upstream changes) 2. TODO unblock detection (11 tracked TODOs with upstream blockers) 3. Test coverage gap analysis (vs core's test matrix) 4. Performance opportunity discovery 5. Code freshness & technical debt 6. Documentation freshness 7. Open work tracking Reports are written to graph_trainer/reports/YYYY-MM-DD.md. The first report (2026-04-02) surfaces two P0 findings: - Llama3 parallelize.py missing enable_cp/enable_sp in apply_tp() call, meaning context parallelism silently malfunctions despite README claiming CP support - fsdp_reshard_after_fwd_pass has zero test coverage (no unit or integration test) [ghstack-poisoned]

SherlockNoMad · 2026-04-03T06:57:52Z

torchtitan/experiments/graph_trainer/.claude/nightly.md

+
+Output: specific inaccuracies found, or "docs are current."
+
+## 7. Open Work Tracking


This is not working as cc is banned from using gh cli.

need to investigate.

torchtitan/experiments/graph_trainer/.claude/nightly.md

SherlockNoMad · 2026-04-03T07:03:46Z

torchtitan/experiments/graph_trainer/.claude/nightly.md

+- Check recent PyTorch commits in `torch/_inductor/`, `torch/_dynamo/`,
+  `torch/_functorch/`, `torch/distributed/_tensor/` for new optimization
+  features that graph_trainer could leverage.
+- Check if any new `torch.compile` modes, backend options, or config knobs
+  have been added that graph_trainer's `compile.py` or `passes.py` should
+  know about.


not working... agent don't know how to access.

…st report" Add a nightly prompt (.claude/nightly.md) designed to be run by Claude Code to discover self-improvement opportunities — not breakage detection (CI handles that), but things like upstream API drift, test coverage gaps, unblocked TODOs, and code freshness issues. The scout covers 7 areas: 1. Core torchtitan delta review (opportunity/risk from upstream changes) 2. TODO unblock detection (11 tracked TODOs with upstream blockers) 3. Test coverage gap analysis (vs core's test matrix) 4. Performance opportunity discovery 5. Code freshness & technical debt 6. Documentation freshness 7. Open work tracking Reports are written to graph_trainer/reports/YYYY-MM-DD.md. The first report (2026-04-02) surfaces two P0 findings: - Llama3 parallelize.py missing enable_cp/enable_sp in apply_tp() call, meaning context parallelism silently malfunctions despite README claiming CP support - fsdp_reshard_after_fwd_pass has zero test coverage (no unit or integration test) It took 6 min to generate the first report. Not bad. [ghstack-poisoned]

Add a nightly prompt (.claude/nightly.md) designed to be run by Claude Code to discover self-improvement opportunities — not breakage detection (CI handles that), but things like upstream API drift, test coverage gaps, unblocked TODOs, and code freshness issues. The scout covers 7 areas: 1. Core torchtitan delta review (opportunity/risk from upstream changes) 2. TODO unblock detection (11 tracked TODOs with upstream blockers) 3. Test coverage gap analysis (vs core's test matrix) 4. Performance opportunity discovery 5. Code freshness & technical debt 6. Documentation freshness 7. Open work tracking Reports are written to graph_trainer/reports/YYYY-MM-DD.md. The first report (2026-04-02) surfaces two P0 findings: - Llama3 parallelize.py missing enable_cp/enable_sp in apply_tp() call, meaning context parallelism silently malfunctions despite README claiming CP support - fsdp_reshard_after_fwd_pass has zero test coverage (no unit or integration test) ghstack-source-id: 7c03f4a Pull Request resolved: #2806

Add a nightly prompt (.claude/nightly.md) designed to be run by Claude Code to discover self-improvement opportunities — not breakage detection (CI handles that), but things like upstream API drift, test coverage gaps, unblocked TODOs, and code freshness issues. The scout covers 5 areas (all local, no network access required): 1. Core torchtitan delta review (opportunity/risk from upstream changes) 2. TODO unblock detection (dynamic discovery, local torch inspection) 3. Test & CI coverage gap analysis (file comparison vs workflow YAMLs) 4. Code freshness & technical debt (monkey-patches, private APIs, config drift) 5. Documentation freshness Removed from earlier version: performance opportunity discovery (produced no actionable output), open work tracking (requires GitHub API), CI status checks (requires GitHub API), git push (requires network access). Reports are written to graph_trainer/reports/YYYY-MM-DD.md. After the report, action items are implemented as one-commit-per-item on a graph_trainer/self_improve/YYYY-MM-DD branch. The first report (2026-04-02) surfaces two P0 findings: - Llama3 parallelize.py missing enable_cp/enable_sp in apply_tp() call, meaning context parallelism silently malfunctions despite README claiming CP support - fsdp_reshard_after_fwd_pass has zero test coverage (no unit or integration test) ghstack-source-id: 7c03f4a Pull Request resolved: #2806

…2838) Add a nightly prompt (.claude/nightly.md) designed to be run by Claude Code to discover self-improvement opportunities — not breakage detection (CI handles that), but things like upstream API drift, test coverage gaps, unblocked TODOs, and code freshness issues. The scout covers 5 areas (all local, no network access required): 1. Core torchtitan delta review (opportunity/risk from upstream changes) 2. TODO unblock detection (dynamic discovery, local torch inspection) 3. Test & CI coverage gap analysis (file comparison vs workflow YAMLs) 4. Code freshness & technical debt (monkey-patches, private APIs, config drift) 5. Documentation freshness Removed from earlier version: performance opportunity discovery (produced no actionable output), open work tracking (requires GitHub API), CI status checks (requires GitHub API), git push (requires network access). Reports are written to graph_trainer/reports/YYYY-MM-DD.md. After the report, action items are implemented as one-commit-per-item on a graph_trainer/self_improve/YYYY-MM-DD branch. The first report (2026-04-02) surfaces two P0 findings: - Llama3 parallelize.py missing enable_cp/enable_sp in apply_tp() call, meaning context parallelism silently malfunctions despite README claiming CP support - fsdp_reshard_after_fwd_pass has zero test coverage (no unit or integration test) ghstack-source-id: 7c03f4a Pull Request resolved: #2806

…ytorch#2838) Add a nightly prompt (.claude/nightly.md) designed to be run by Claude Code to discover self-improvement opportunities — not breakage detection (CI handles that), but things like upstream API drift, test coverage gaps, unblocked TODOs, and code freshness issues. The scout covers 5 areas (all local, no network access required): 1. Core torchtitan delta review (opportunity/risk from upstream changes) 2. TODO unblock detection (dynamic discovery, local torch inspection) 3. Test & CI coverage gap analysis (file comparison vs workflow YAMLs) 4. Code freshness & technical debt (monkey-patches, private APIs, config drift) 5. Documentation freshness Removed from earlier version: performance opportunity discovery (produced no actionable output), open work tracking (requires GitHub API), CI status checks (requires GitHub API), git push (requires network access). Reports are written to graph_trainer/reports/YYYY-MM-DD.md. After the report, action items are implemented as one-commit-per-item on a graph_trainer/self_improve/YYYY-MM-DD branch. The first report (2026-04-02) surfaces two P0 findings: - Llama3 parallelize.py missing enable_cp/enable_sp in apply_tp() call, meaning context parallelism silently malfunctions despite README claiming CP support - fsdp_reshard_after_fwd_pass has zero test coverage (no unit or integration test) ghstack-source-id: 7c03f4a Pull Request resolved: pytorch#2806

pytorch-bot bot added the ciflow/8gpu label Apr 3, 2026

SherlockNoMad mentioned this pull request Apr 3, 2026

[graph_trainer] Add bitwise deterministic guardrail test #2799

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 3, 2026

This was referenced Apr 3, 2026

[graph_trainer] Add benchmark.py for forward-backward profiling #2805

Closed

[graph_trainer] Add prerequisite step to nightly scout: checkout main #2807

Closed

SherlockNoMad requested review from aditvenk, tianyu-l and xmfan April 3, 2026 06:56

SherlockNoMad commented Apr 3, 2026

View reviewed changes

torchtitan/experiments/graph_trainer/.claude/nightly.md Show resolved Hide resolved

SherlockNoMad requested a review from yiming0416 April 3, 2026 07:00

SherlockNoMad commented Apr 3, 2026

View reviewed changes

SherlockNoMad mentioned this pull request Apr 3, 2026

[graph_trainer] Add nightly self-improvement scout and first report #2838

Merged

SherlockNoMad closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph_trainer] Add nightly self-improvement scout and first report#2806

[graph_trainer] Add nightly self-improvement scout and first report#2806
SherlockNoMad wants to merge 2 commits intogh/SherlockNoMad/11/basefrom
gh/SherlockNoMad/11/head

SherlockNoMad commented Apr 3, 2026 •

edited

Loading

Uh oh!

SherlockNoMad Apr 3, 2026

Uh oh!

Uh oh!

SherlockNoMad Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		Output: specific inaccuracies found, or "docs are current."

		## 7. Open Work Tracking

Conversation

SherlockNoMad commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SherlockNoMad Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SherlockNoMad Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SherlockNoMad commented Apr 3, 2026 •

edited

Loading