Conversation
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
12 similar comments
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
1 similar comment
📊 Line Count ReportFile: Total Lines: 956 |
📊 Line Count ReportFile: Total Lines: 956 |
1 similar comment
📊 Line Count ReportFile: Total Lines: 956 |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces automated validation of critical-path configurations through GitHub PR labels. When developers add a label matching <RUNNER_TYPE>_<MODEL_PREFIX> (e.g., b200_gptoss), it automatically triggers validation workflows for the affected configuration.
Key Changes:
- Adds
label-validation.ymlworkflow that triggers on PR label events - Parses labels to extract runner type and model prefix, then generates appropriate test configurations
- Includes temporary GB200 workaround pending resolution of multi-node testing architecture issue
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
functionstackx
left a comment
There was a problem hiding this comment.
- missing collect-results
Design Doc: Auto PR Validation via Labels
Validate all labels: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19142957749/job/54712862609
Introduction
Currently, when a change is made to a critical-path for a particular configuration, users must manually run workflows to validate the changes. In an effort to move towards general ease-of-use within the InferenceMAX environment, we propose adding validation of particular configs (i.e., gptoss-h200) via pull request labels.
Implementation
Since PR #145, developers have more control over which jobs run, thanks to external master config files and a Python reconciliation script. This architecture lends itself nicely to having a CI label for auto validation of configurations. Below we propose how this will work.
When a user opens a PR, they will still be responsible for identifying what configuration their changes affect. However, once they do so, they will be able to add a label to the PR of the form
<RUNNER_TYPE>_<MODEL_PREFIX>. This will then invoke a new workflowlabel-validation.ymlwhich will runFirst, a job will be run to parse the label and decide whether or not it meets the format we expect
<RUNNER_TYPE>_<MODEL_PREFIX>. Note that this validation just performs a simple regex match for two strings separated by a “_” delimiter. It was considered to perform more validation on this label by loading the.github/configsfiles and checking to make sureRUNNER_TYPEandMODEL_PREFIXactually exist and are valid, however this would add complexity to the workflow file that we deem unnecessary for now. This can always be added in the future. Upon a match, we save the pattern groups accordingly as runner-type and model-prefix as job outputs to be passed to the next job,get-jobs. This job invokespython3 ${GITHUB_WORKSPACE}/utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix MODEL_PREFIX --runner-type RUNNER_TYPE --seq-lens 1k1k --test-mode --config-files ...and saves the output to
GITHUB_OUTPUTfor use in the next job, which will actually generate the matrix and run the appropriate jobs.As a quick refresher, this command will find all the configs that match
RUNNER_TYPE,MODEL_PREFIX, and sequence length1k1kand run only the highest TP level with the lowest available CONC (as indicated by--test-mode).For instance, let’s say that a developer makes a change to
benchmarks/gptoss_fp4_b200_docker.sh(critical path), they would then add the labelb200_gptoss. This will trigger thelabel-validation.ymlworkflow and first run the job parse-label which will successfully match the regex pattern to extractRUNNER_TYPEandMODEL_PREFIXand send it to the appropriate job output variables. The next step will only run if the job output variables from the previous step are not empty, and will invoke the full-sweep command with the appropriate--runner-typeand--model-prefix. Upon success, the validate job will run, calling the actualbenchmark-tmpl.ymlworkflow with the generated parameters from the previous job. Finally, the calc-success-rate job (per usual) at the end of the workflow to calculate the per-GPU success rate.Labels to Add
The following labels will be added, with runners in accordance with the keys in .github/configs/runners.yaml:
GB200 Integration
Currently, the way GB200 (and all multinode tests) are run is very different from the single node configurations. This is not desirable and should be fixed ASAP (see issue https://github.com/InferenceMAX/InferenceMAX/issues/171).
Therefore, we will skip integrating GB200 with label auto-validation (for now). This is fine since there is a separate
gb200-tests.ymlworkflow.Considerations
Hard-Coding Config Files
One possible point of concern is the fact that the master config files along with the runner config file are hardcoded in the get-jobs job. This is potentially problematic as this file will need to be updated if and when additional config files are added in the future.
Insufficient Coverage
Recall that the invoked command to the
generate_sweep_configs.pyscript applies the--test-mode argument, which runs only the highest TP + lowest concurrency run for the specified configuration. This can be problematic if, say, a developer makes a change that only affects the TP4 configurations, and the label only runs the TP8 configurations.This is likely fine, because in this case, the developer can just manually run the
e2e-tests.ymlworkflow and be more specific about what they would like to run.Too Much Coverage
Contrary to the point above, the label (being quite inspecific) may unnecessarily run certain configs. For instance, in the case of models with multiple precisions, such as DeepSeek, adding the label h200_dsr1 will run both FP4 and FP8, even if only the FP8 critical path needs to be tested.
Again, this is probably fine for now because the label is still net-net more convenient than manually running the workflows.
Providing an Incorrect Label
If an invalid label is provided, the workflow will fail since the
generate_sweep_configs.pyscript which is ultimately invoked validates the runner type argument. Here is an example: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19117241089/job/54629331775.Security Considerations
First and foremost, only users with at least write permissions on the repo can directly add a label to a PR. This eliminates the case of a random contributor being able to trigger a workflow run on their PR from a fork.
Furthermore, we tested the following path:
HF_TOKENare not passed to workflows running on a PR from a fork. From our contact at GitHub: “When running a workflow for a PR coming from a fork, the secrets are not available. It runs in the context of the fork, not the main repo.”This is fine, since most PRs that will require validation will come from contributors with write permissions already, and are opening up PRs in the context of the actual InferenceMAX repo and not a fork.