Skip to content

fix: fix success rate calculation#84

Merged
kimbochen merged 9 commits intomainfrom
fix-success-rate
Oct 6, 2025
Merged

fix: fix success rate calculation#84
kimbochen merged 9 commits intomainfrom
fix-success-rate

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Oct 5, 2025

The logic in utils/calculate_success_rate.py was broken.

Now, instead of hard-coding the total possible calculations for each GPU, simply count all of the attempted job runs and all of the successful job runs for each GPU.

Full Sweep Test Example:
https://github.com/InferenceMAX/InferenceMAX/actions/runs/18265141263/job/51998546795#logs
Successfully calculates success rates of a reduced H100 sweep.

Additional Examples:

❯ export GITHUB_TOKEN=<redacted> GITHUB_RUN_ID=18180663482 GITHUB_REPOSITORY="InferenceMAX/InferenceMAX"
❯ python3 utils/calc_success_rate.py run_stats
Authenticated as user: cquil11
Found repo: InferenceMAX/InferenceMAX
Found run: 18180663482 - Full Sweep Scheduler - 8k1k

============================================================
GPU Success Rates
============================================================
GPU        Success    Total      Rate      
------------------------------------------------------------
b200       164        164        100.00    %
gb200      3          3          100.00    %
h100       30         30         100.00    %
h200       94         94         100.00    %
mi300x     45         45         100.00    %
mi325x     45         45         100.00    %
mi355x     74         75         98.67     %
============================================================

❯ cat run_stats.json
{
  "h100": {
    "n_success": 30,
    "total": 30
  },
  "h200": {
    "n_success": 94,
    "total": 94
  },
  "gb200": {
    "n_success": 3,
    "total": 3
  },
  "mi300x": {
    "n_success": 45,
    "total": 45
  },
  "mi325x": {
    "n_success": 45,
    "total": 45
  },
  "mi355x": {
    "n_success": 74,
    "total": 75
  },
  "b200": {
    "n_success": 164,
    "total": 164
  }
}

@kimbochen
Copy link
Copy Markdown
Collaborator

Looks great, thank you so much!
Did you set the GitHub env vars, e.g. GITHUB_TOKEN etc?

@cquil11
Copy link
Copy Markdown
Collaborator Author

cquil11 commented Oct 5, 2025

Looks great, thank you so much! Did you set the GitHub env vars, e.g. GITHUB_TOKEN etc?

I think they are set by the runner setup automatically. Checking now in the test run.

Copy link
Copy Markdown
Collaborator

@kimbochen kimbochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kimbochen kimbochen merged commit 42e8e45 into main Oct 6, 2025
@kimbochen kimbochen deleted the fix-success-rate branch October 6, 2025 00:49
Oseltamivir added a commit that referenced this pull request Apr 29, 2026
Stable ai-dynamo 1.0.2 (the only release on pypi.nvidia.com) imports
vllm.inputs.data, which vllm-project/vllm#35182 (2026-03-26, commit
ba2f0acc) deleted; the deepseekv4-cu130 image used here is post-deletion,
so dynamo.vllm workers crash on import in multimodal_handlers/__init__.
The fix shipped in 1.2.0.dev wheels but srt-slurm PR #84's DynamoConfig
schema only accepts version (PyPI) / hash / top_of_tree — no wheel: field.

Set dynamo.install: false in all 5 gb300 recipes so srtctl emits no
worker-side install line, and have launch_gb300-cw.sh:
  * stage aarch64 1.2.0.dev20260426 wheels under /mnt/vast/dynamo-wheels/
    (mkdir-as-lock; flock is unreliable on this VAST mount per prior runs)
  * symlink the cache into srt-slurm/configs/dynamo-wheels/ so the
    container sees /configs/dynamo-wheels/<version>/
  * append a `pip install --no-index --find-links` line to upstream's
    configs/vllm-container-deps.sh (which srtctl already runs in every
    worker container before launching dynamo.vllm)

This matches the working gb200-nv path (wheel: "1.2.0.dev20260426" on
the aflowers/vllm-gb200-v0.20.0 srt-slurm branch) without grafting that
branch's 19k-line schema diff onto PR #84.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants