fix: fix success rate calculation by cquil11 · Pull Request #84 · SemiAnalysisAI/InferenceX

cquil11 · 2025-10-05T20:33:40Z

The logic in utils/calculate_success_rate.py was broken.

Now, instead of hard-coding the total possible calculations for each GPU, simply count all of the attempted job runs and all of the successful job runs for each GPU.

Full Sweep Test Example:
https://github.com/InferenceMAX/InferenceMAX/actions/runs/18265141263/job/51998546795#logs
Successfully calculates success rates of a reduced H100 sweep.

Additional Examples:

https://github.com/InferenceMAX/InferenceMAX/actions/runs/18180663482
- Has one failed mi355 run

❯ export GITHUB_TOKEN=<redacted> GITHUB_RUN_ID=18180663482 GITHUB_REPOSITORY="InferenceMAX/InferenceMAX"
❯ python3 utils/calc_success_rate.py run_stats
Authenticated as user: cquil11
Found repo: InferenceMAX/InferenceMAX
Found run: 18180663482 - Full Sweep Scheduler - 8k1k

============================================================
GPU Success Rates
============================================================
GPU        Success    Total      Rate      
------------------------------------------------------------
b200       164        164        100.00    %
gb200      3          3          100.00    %
h100       30         30         100.00    %
h200       94         94         100.00    %
mi300x     45         45         100.00    %
mi325x     45         45         100.00    %
mi355x     74         75         98.67     %
============================================================

❯ cat run_stats.json
{
  "h100": {
    "n_success": 30,
    "total": 30
  },
  "h200": {
    "n_success": 94,
    "total": 94
  },
  "gb200": {
    "n_success": 3,
    "total": 3
  },
  "mi300x": {
    "n_success": 45,
    "total": 45
  },
  "mi325x": {
    "n_success": 45,
    "total": 45
  },
  "mi355x": {
    "n_success": 74,
    "total": 75
  },
  "b200": {
    "n_success": 164,
    "total": 164
  }
}

kimbochen · 2025-10-05T20:40:50Z

Looks great, thank you so much!
Did you set the GitHub env vars, e.g. GITHUB_TOKEN etc?

cquil11 · 2025-10-05T20:42:35Z

Looks great, thank you so much! Did you set the GitHub env vars, e.g. GITHUB_TOKEN etc?

I think they are set by the runner setup automatically. Checking now in the test run.

kimbochen

lgtm

Stable ai-dynamo 1.0.2 (the only release on pypi.nvidia.com) imports vllm.inputs.data, which vllm-project/vllm#35182 (2026-03-26, commit ba2f0acc) deleted; the deepseekv4-cu130 image used here is post-deletion, so dynamo.vllm workers crash on import in multimodal_handlers/__init__. The fix shipped in 1.2.0.dev wheels but srt-slurm PR #84's DynamoConfig schema only accepts version (PyPI) / hash / top_of_tree — no wheel: field. Set dynamo.install: false in all 5 gb300 recipes so srtctl emits no worker-side install line, and have launch_gb300-cw.sh: * stage aarch64 1.2.0.dev20260426 wheels under /mnt/vast/dynamo-wheels/ (mkdir-as-lock; flock is unreliable on this VAST mount per prior runs) * symlink the cache into srt-slurm/configs/dynamo-wheels/ so the container sees /configs/dynamo-wheels/<version>/ * append a `pip install --no-index --find-links` line to upstream's configs/vllm-container-deps.sh (which srtctl already runs in every worker container before launching dynamo.vllm) This matches the working gb200-nv path (wheel: "1.2.0.dev20260426" on the aflowers/vllm-gb200-v0.20.0 srt-slurm branch) without grafting that branch's 19k-line schema diff onto PR #84.

cquil11 added 2 commits October 5, 2025 14:47

adding fix

ebe7490

fixing logic in calc_success_rate.py

407a8a6

cquil11 requested review from functionstackx and kimbochen October 5, 2025 20:33

cquil11 added 2 commits October 5, 2025 15:34

remove extraneous file

98aad82

adding new success rate code to test sweep workflow file

0451ff4

cquil11 added 5 commits October 5, 2025 16:35

add enum for job state

e75f09b

testing

a8cb040

testing pt 2

07e6e8a

testing pt 3

c8133f8

reverting temp change

ce902c7

kimbochen approved these changes Oct 6, 2025

View reviewed changes

kimbochen merged commit 42e8e45 into main Oct 6, 2025

kimbochen deleted the fix-success-rate branch October 6, 2025 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix success rate calculation#84

fix: fix success rate calculation#84
kimbochen merged 9 commits intomainfrom
fix-success-rate

cquil11 commented Oct 5, 2025 •

edited

Loading

Uh oh!

kimbochen commented Oct 5, 2025

Uh oh!

cquil11 commented Oct 5, 2025

Uh oh!

kimbochen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cquil11 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kimbochen commented Oct 5, 2025

Uh oh!

cquil11 commented Oct 5, 2025

Uh oh!

kimbochen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cquil11 commented Oct 5, 2025 •

edited

Loading