Skip to content

Add in-tree multi-repo support and torchtitan profile#9

Merged
xmfan merged 8 commits intodrisspg:mainfrom
xmfan:xmfan/torchtitan
Apr 13, 2026
Merged

Add in-tree multi-repo support and torchtitan profile#9
xmfan merged 8 commits intodrisspg:mainfrom
xmfan:xmfan/torchtitan

Conversation

@xmfan
Copy link
Copy Markdown
Collaborator

@xmfan xmfan commented Apr 6, 2026

Summary

Refactor hardcoded pytorch profile into a config-driven repo registry ([repos.*] in config.toml)
Add torchtitan as an add-on repo with dedicated prompts, lightweight venv setup (inherits base torch build), and per-job worktrees
Wire --repo through CLI, web UI, services, job IDs, and issue lookup
Both repos are always cloned in every workspace; --repo selects which repo an issue is filed in

Test plan

  • python -m pytest tests/ passes
  • ptq --help / ptq run --help / ptq worktree --help show updated descriptions
  • Web UI dropdown says "Issue from"
  • ptq setup gpu-dev — workspace clones both repos
  • ptq run --issue 2818 --repo torchtitan --machine gpu-dev — torchtitan job runs, agent can cross-reference pytorch source
  • ptq run --issue 179597 --machine gpu-dev — pytorch job runs
  • Web UI: job list shows repo column, new job form works for both repos
  • uv run ptq run --repo torchtitan --machine gpu-dev-745c50c9 -m "fix the error as seen in the description of [Bug][EP][compile] non_blocking=True D2H race in input_splits under per-layer torch.compile pytorch/torchtitan#2951"

@xmfan xmfan force-pushed the xmfan/torchtitan branch from 419db27 to 5c79fc6 Compare April 8, 2026 08:20
## Debugging Tools

**Distributed training debugging**:
- Run with single process first: `CUDA_VISIBLE_DEVICES=0 {workspace}/jobs/{job_id}/.venv/bin/python <script.py>`
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works if the script uses fake process group. let's remove this instruction and always use torchrun

job_dir = f"{backend.workspace}/jobs/{job_id}"
worktree = f"{job_dir}/pytorch"

from ptq.repo_profiles import get_profile
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports at top of file

job_dir = f"{workspace}/jobs/{job_id}"
worktree = f"{job_dir}/pytorch"

from ptq.repo_profiles import get_profile
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports top of file

ptq/cli.py Outdated
] = "pytorch",
) -> None:
"""Launch an AI agent to investigate a PyTorch issue or run an adhoc task.
"""Launch an AI agent to investigate a PyTorch/TorchTitan issue or run an adhoc task.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just remove repo names from prompts

@xmfan xmfan force-pushed the xmfan/torchtitan branch 3 times, most recently from 50a894b to 643ebe2 Compare April 9, 2026 09:39
@xmfan xmfan changed the title Add torchtitan support Add multi-repo support and torchtitan profile Apr 9, 2026
xmfan added 5 commits April 9, 2026 11:59
Move hardcoded pytorch profile into a config-driven RepoProfile
registry loaded from [repos.*] sections in config.toml. Prompt
templates are discovered by naming convention. Built-in defaults
used as fallback when config has no [repos] section.
- Add torchtitan profile to config.toml and _DEFAULT_PROFILES
- Add investigate/adhoc prompt templates for torchtitan
- Add repo field to JobRecord and RunRequest
- Include repo name in job IDs to avoid cross-repo collisions
- Filter find_by_issue by repo for correct re-run matching
- Update agent.py and issue.py to use repo profiles
- run_service / worktree_service: repo-aware worktree and venv setup;
  move _setup_lightweight_venv to worktree_service
- job_service / pr_service / rebase_service: top-level profile imports
- cli.py: generic --repo flag, auto-reload via create_debug_app factory
- workspace.py: generic _clone_repo driven by repo profiles
- app.py: add create_debug_app() factory for uvicorn auto-reload
- routes.py: pass profile objects to template for dynamic repo dropdown,
  repo column in job list, merge-base diff, dynamic issue links
- templates: iterate repos from config, repo column, dynamic issue links
@xmfan xmfan force-pushed the xmfan/torchtitan branch from 643ebe2 to edf160f Compare April 9, 2026 10:00
from ptq.application.worktree_service import provision_worktree, validate_workspace
from ptq.domain.policies import make_job_id
from ptq.infrastructure.backends import create_backend
from ptq.repo_profiles import get_profile
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently there's some circular imports

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we un circularize ?

@xmfan xmfan marked this pull request as ready for review April 9, 2026 10:01
@xmfan xmfan force-pushed the xmfan/torchtitan branch from 8cf51f7 to c2311b1 Compare April 9, 2026 13:04
@xmfan xmfan changed the title Add multi-repo support and torchtitan profile Add multi-repo support + torchtitan Apr 9, 2026
@xmfan xmfan force-pushed the xmfan/torchtitan branch from c6922c0 to ea2fb37 Compare April 13, 2026 15:52
@xmfan xmfan changed the title Add multi-repo support + torchtitan Add multi-repo support with .ptq/ folder discovery Apr 13, 2026
@xmfan xmfan force-pushed the xmfan/torchtitan branch 2 times, most recently from 80dbd15 to 09538ea Compare April 13, 2026 16:21
@xmfan xmfan force-pushed the xmfan/torchtitan branch from 09538ea to c2311b1 Compare April 13, 2026 21:51
xmfan added 2 commits April 13, 2026 14:54
config._parse() was importing repo_profiles.load_profiles_from_config,
while repo_profiles._loaded_profiles() imported config.load_config.

Fix: config.py now stores the raw [repos.*] TOML dict as repos_raw.
repo_profiles parses it when loading profiles. One-way dependency.
ptq setup now only clones pytorch by default. Add-on repos like
torchtitan are cloned only when explicitly requested:

    ptq setup <machine> --extras torchtitan
@xmfan xmfan changed the title Add multi-repo support with .ptq/ folder discovery Add multi-repo support and torchtitan profile Apr 13, 2026
@xmfan xmfan changed the title Add multi-repo support and torchtitan profile Add in-tree multi-repo support and torchtitan profile Apr 13, 2026
@xmfan xmfan merged commit be28d87 into drisspg:main Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants