Skip to content

Integrate with pt_job_queue#2947

Closed
xmfan wants to merge 1 commit intopytorch:mainfrom
xmfan:main
Closed

Integrate with pt_job_queue#2947
xmfan wants to merge 1 commit intopytorch:mainfrom
xmfan:main

Conversation

@xmfan
Copy link
Copy Markdown
Member

@xmfan xmfan commented Apr 13, 2026

Changed needed for: drisspg/pt_job_queue#9

There's 2 usage modes:

  • control and run locally, on devgpu (directly call ptq)
  • control locally and run remotely on ODC (prefix ptq commands with uv run)

Quick Start: Using ptq with TorchTitan (Remote on ODC)

Prerequisites

  • pt_job_queue cloned locally

1. Set up workspace

uv run ptq setup <machine>

2. Register TorchTitan

uv run ptq repo add git@github.com:xmfan/torchtitan.git --machine <machine>
uv run ptq repo list

3. Run a job

# Investigate a GitHub issue
uv run ptq run --issue 2818 --repo torchtitan --machine <machine>

# Run an adhoc task
uv run ptq run --repo torchtitan --machine <machine> -m "fix the FSDP OOM bug"

4. Check results

uv run ptq list                  # see all jobs
uv run ptq peek <job_id>         # check progress
uv run ptq results <job_id>      # fetch results
uv run ptq web                   # web dashboard at http://127.0.0.1:8000

Testing plan in upstream PR: drisspg/pt_job_queue#9.
image

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 13, 2026
@xmfan xmfan marked this pull request as ready for review April 13, 2026 16:50

### 2. Investigate
- Read relevant TorchTitan source code in `{workspace}/jobs/{job_id}/torchtitan/`.
- Key source locations: `torchtitan/models/`, `torchtitan/parallelisms/`, `torchtitan/train.py`, `torchtitan/config_manager.py`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchtitan/parallelisms/ and torchtitan/config_manager.py do not exist. Do you mean torchtitan/distributed/, torchtitan/config/job_config.py?

Copy link
Copy Markdown
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline:

  • since ptq is still new, as torchtitan maintainer I prefer this not to be exposed to OSS publicly yet
  • on ptq side @drisspg seems happy with torchtitan specifics added to ptq repo, so technically this is unblocked
  • more than happy to revisit later

@xmfan
Copy link
Copy Markdown
Member Author

xmfan commented Apr 13, 2026

Landed as a hardcoded repo in drisspg/pt_job_queue#9

@xmfan xmfan closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants