Skip to content

Trigger H200 multinode evals & revert MI355X image to mori-0227-3#1094

Merged
Oseltamivir merged 9 commits intomainfrom
fix/amd-fp4-mi355x-image-revert
Apr 21, 2026
Merged

Trigger H200 multinode evals & revert MI355X image to mori-0227-3#1094
Oseltamivir merged 9 commits intomainfrom
fix/amd-fp4-mi355x-image-revert

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented Apr 19, 2026

Summary

  • Trigger missing H200 multinode evals from Multinode evals #1000
  • Revert dsr1-fp4-mi355x-sglang-disagg and dsr1-fp4-mi355x-sglang-disagg-mtp image from mori-0313-2 back to mori-0227-3
  • H100 multinode still missing due to cluster NVSHEMM issues

Missed staging this change before merging #1000.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir changed the title Revert DSR1 FP4 MI355X SGLang image to mori-0227-3 Trigger GB300 evals & revert MI355X image to mori-0227-3 Apr 19, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — simple image tag revert with no bugs found.

Extended reasoning...

Overview

Two-line change in .github/configs/amd-master.yaml reverting the Docker image tag for dsr1-fp4-mi355x-sglang-disagg and dsr1-fp4-mi355x-sglang-disagg-mtp from mori-0313-2 back to mori-0227-3. No logic, model, or search-space configuration is touched.

Security risks

None. This is a config-only image tag change with no auth, crypto, or permission implications.

Level of scrutiny

Minimal — this is a mechanical revert of a missed staging step from a prior merge, as described in the PR summary. The change is exactly two lines and purely declarative.

Other factors

No bugs were reported by the automated system. The PR timeline contains only the standard bot reminder, and there are no outstanding reviewer comments. The change pattern matches other image-tag reverts in this repo.

@Oseltamivir Oseltamivir changed the title Trigger GB300 evals & revert MI355X image to mori-0227-3 Trigger H200 multinode evals & revert MI355X image to mori-0227-3 Apr 19, 2026
Oseltamivir and others added 5 commits April 19, 2026 17:45
Same approach as B200 launcher — overrides max_attempts to 720
in the srt-slurm config before submitting. Default 180 (30 min)
is too short for disagg SGLang EAGLE cold start.
H200 recipes don't have a health_check section (unlike B200),
so the sed replacement was a silent no-op. Now appends the block
if max_attempts isn't found in the config file.
@Oseltamivir Oseltamivir merged commit d9e1a75 into main Apr 21, 2026
5 checks passed
@Oseltamivir Oseltamivir deleted the fix/amd-fp4-mi355x-image-revert branch April 21, 2026 02:18
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 21, 2026
Pulls 55 upstream commits published on SemiAnalysisAI/InferenceX:main
since PR SemiAnalysisAI#1032 was opened. Zero conflicts; none touch tools/ or
datasets/isb1/. Purpose: modernize PR base before Cam review and
absorb upstream fork-drift reductions.

Notable upstream work picked up:
- MiniMax M2.5 MXFP4 MI355X + B300 configs
- GLM5.1 FP4 MI355X support
- GPT-OSS FP4 TP=8 conc=1 extension (SemiAnalysisAI#1096)
- H200 multinode evals (SemiAnalysisAI#1000)
- B300 configs for Kimi K2.5, DSR1, Qwen3.5
- Parallel random data generation (SemiAnalysisAI#1094)
- KNOWN_LIMITATION.md updates

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@billishyahao
Copy link
Copy Markdown
Collaborator

Hey @Oseltamivir may I know the reason of reverting mori image back to 0227-3?

@claude claude Bot mentioned this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants