[atom-vllm benchmark] remove flydsl gdr kernel for q3next TP1 benchmark case#671
[atom-vllm benchmark] remove flydsl gdr kernel for q3next TP1 benchmark case#671zejunchen-zejun wants to merge 4 commits intomainfrom
Conversation
q3next TP1 benchmark case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
There was a problem hiding this comment.
Pull request overview
Removes the FlyDSL GDR decode kernel toggle for the Qwen3-Next-80B-A3B-Instruct-FP8 TP1 out-of-the-box benchmark configuration, so that this benchmark runs without ATOM_USE_FLYDSL_GDR=1 enabled by default.
Changes:
- Drop
ATOM_USE_FLYDSL_GDR=1from the TP1 benchmark entry’senv_varsfor Qwen3-Next-80B-A3B-Instruct-FP8.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return gemma_rmsnorm_triton( | ||
| x, self.weight.data, self.variance_epsilon, residual | ||
| ) |
There was a problem hiding this comment.
GemmaRMSNorm.forward_cuda now returns the raw result of gemma_rmsnorm_triton, which is always a tuple (out, residual_out|None). When residual is None, callers like hidden_states = self.input_layernorm(hidden_states) expect a Tensor, not (Tensor, None), which will break at runtime. Unwrap the return to match forward_static semantics: return out when residual is None, else return (out, residual_out).
| return gemma_rmsnorm_triton( | |
| x, self.weight.data, self.variance_epsilon, residual | |
| ) | |
| out, residual_out = gemma_rmsnorm_triton( | |
| x, self.weight.data, self.variance_epsilon, residual | |
| ) | |
| return out if residual is None else (out, residual_out) |
| from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton | ||
|
|
There was a problem hiding this comment.
GemmaRMSNorm.forward() always routes to forward_cuda() when not using the fused-quant path; with this change forward_cuda() unconditionally imports/uses Triton. This removes the CPU-safe fallback that previously existed (it used PyTorch ops) and will fail if x is on CPU or Triton isn't available. Consider guarding on x.is_cuda/ROCm and falling back to forward_native otherwise.
| from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton | |
| if not x.is_cuda: | |
| return self.forward_native(x, residual) | |
| try: | |
| from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton | |
| except (ImportError, ModuleNotFoundError): | |
| return self.forward_native(x, residual) |
| Replaces the torch.compile'd GemmaRMSNorm.forward_static with a single Triton | ||
| kernel that fuses residual add and RMS normalization with the Gemma-style | ||
| weight offset: out = rmsnorm(x + residual) * (1 + w). | ||
|
|
||
| Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset. | ||
|
|
||
| Two custom ops are registered so that torch.compile (Dynamo) can trace through | ||
| them without falling back to the PyTorch implementation that contains | ||
| x.float() / x.to(orig_dtype) dtype-cast copy kernels: | ||
| - ``fused_gemma_rmsnorm`` (no residual) | ||
| - ``fused_gemma_add_rmsnorm`` (with residual add) |
There was a problem hiding this comment.
The module docstring says “Two custom ops are registered …” (fused_gemma_rmsnorm / fused_gemma_add_rmsnorm), but this file only defines a Triton kernel + Python launcher and does not register any torch.library/custom op. Either implement the custom op registration (so torch.compile can trace it as intended) or update the docstring to accurately describe the current behavior (likely a graph break under Dynamo).
| Replaces the torch.compile'd GemmaRMSNorm.forward_static with a single Triton | |
| kernel that fuses residual add and RMS normalization with the Gemma-style | |
| weight offset: out = rmsnorm(x + residual) * (1 + w). | |
| Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset. | |
| Two custom ops are registered so that torch.compile (Dynamo) can trace through | |
| them without falling back to the PyTorch implementation that contains | |
| x.float() / x.to(orig_dtype) dtype-cast copy kernels: | |
| - ``fused_gemma_rmsnorm`` (no residual) | |
| - ``fused_gemma_add_rmsnorm`` (with residual add) | |
| Provides a Triton kernel and Python launcher for GemmaRMSNorm with optional | |
| residual add, fusing residual addition and RMS normalization with the | |
| Gemma-style weight offset: out = rmsnorm(x + residual) * (1 + w). | |
| Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset. | |
| This module does not register ``torch.library`` custom ops; it only exposes | |
| the Triton kernel and the ``gemma_rmsnorm_triton`` launcher defined below. | |
| As a result, ``torch.compile``/Dynamo will not see custom-op registrations | |
| from this file alone. |
| x = x + res | ||
| # Store residual_out (needed by next layer) | ||
| res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets | ||
| res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,)) | ||
| tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask) | ||
|
|
||
| x = x.to(tl.float32) | ||
|
|
There was a problem hiding this comment.
In the residual path, the kernel adds x = x + res before promoting to fp32 (x = x.to(tl.float32)), which means fp16 inputs will accumulate the residual in fp16 (lower precision) unlike GemmaRMSNorm.forward_static which explicitly does fp32 addition for fp16. To preserve numerical behavior, cast x and res to fp32 before the add (and only cast once).
| x = x + res | |
| # Store residual_out (needed by next layer) | |
| res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets | |
| res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,)) | |
| tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask) | |
| x = x.to(tl.float32) | |
| x = x.to(tl.float32) + res.to(tl.float32) | |
| # Store residual_out (needed by next layer) | |
| res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets | |
| res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,)) | |
| tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask) | |
| else: | |
| x = x.to(tl.float32) |
| pip install -r requirements.txt && \\ | ||
| git checkout "${AITER_COMMIT}" && \\ |
There was a problem hiding this comment.
This shallow clone (--depth 1) is incompatible with checking out an arbitrary ${AITER_COMMIT}: if the commit isn’t the repo’s current HEAD, git checkout will fail because the object isn’t in the shallow history. Remove --depth 1 or fetch the specific commit/branch before checkout (e.g., git fetch --depth 1 origin <sha>).
| pip install -r requirements.txt && \\ | |
| git checkout "${AITER_COMMIT}" && \\ | |
| git fetch --depth 1 origin "${AITER_COMMIT}" && \\ | |
| git checkout "${AITER_COMMIT}" && \\ | |
| pip install -r requirements.txt && \\ |
| pip install -r requirements.txt && \\ | ||
| git checkout "${AITER_COMMIT}" && \\ | ||
| git submodule sync && git submodule update --init --recursive && \\ |
There was a problem hiding this comment.
pip install -r requirements.txt is run before git checkout "${AITER_COMMIT}", so dependencies are installed from whatever default-branch revision the shallow clone initially pulled rather than the selected commit. Checkout (and submodules) should happen before installing requirements to keep the environment consistent with the chosen AITER revision.
| pip install -r requirements.txt && \\ | |
| git checkout "${AITER_COMMIT}" && \\ | |
| git submodule sync && git submodule update --init --recursive && \\ | |
| git fetch --depth 1 origin "${AITER_COMMIT}" && \\ | |
| git checkout "${AITER_COMMIT}" && \\ | |
| git submodule sync && git submodule update --init --recursive && \\ | |
| pip install -r requirements.txt && \\ |
| """ | ||
| Forward pass with three parts: | ||
| 1. Input projection | ||
| 2. Core attention (custom op) |
There was a problem hiding this comment.
The docstring says “Forward pass with three parts” but the enumerated list now only includes parts 1 and 2, while the code still has a “Part 3: Output Projection” section below. Update the docstring list to include the third part again (or adjust wording) to keep documentation consistent.
| 2. Core attention (custom op) | |
| 2. Core attention (custom op) | |
| 3. Output projection |
| BLOCK_SIZE = triton.next_power_of_2(n_cols) | ||
| NUM_PRGMS = min(n_rows, 304) # MI355X has 304 CUs | ||
|
|
There was a problem hiding this comment.
NUM_PRGMS = min(n_rows, 304) hard-codes MI355X’s CU count, which may under/over-subscribe other GPUs and makes performance tuning device-specific in a generic op. Consider deriving this from device properties (if available) or making it a configurable heuristic keyed off the active GPU arch.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pip install -r requirements.txt && \\ | ||
| git checkout "${AITER_COMMIT}" && \\ | ||
| git submodule sync && git submodule update --init --recursive && \\ |
There was a problem hiding this comment.
The AITER install in the generated Dockerfile uses git clone --depth 1 ... and then git checkout "${AITER_COMMIT}". If ${AITER_COMMIT} is not the repo’s default-branch HEAD, the checkout will fail in a shallow clone. Consider cloning without --depth 1, or fetching the specific commit (git fetch --depth 1 origin ${AITER_COMMIT}) before checkout; also, install requirements.txt after checking out the target commit so dependencies match that revision.
| pip install -r requirements.txt && \\ | |
| git checkout "${AITER_COMMIT}" && \\ | |
| git submodule sync && git submodule update --init --recursive && \\ | |
| git fetch --depth 1 origin "${AITER_COMMIT}" && \\ | |
| git checkout "${AITER_COMMIT}" && \\ | |
| git submodule sync && git submodule update --init --recursive && \\ | |
| pip install -r requirements.txt && \\ |
| && needs.build-benchmark-matrix.result == 'success' | ||
| && needs.build-benchmark-matrix.outputs.has_benchmark_cells == 'true' | ||
| runs-on: build-only-atom | ||
| runs-on: atom-mi355-8gpu-oot-benchmark |
There was a problem hiding this comment.
runs-on was changed to atom-mi355-8gpu-oot-benchmark, but this runner label is not present in .github/runner-config.yml (which is used for dashboard GPU arch/count metadata and explicitly says it must be updated when runners change). Please add a mapping entry for atom-mi355-8gpu-oot-benchmark in .github/runner-config.yml to keep CI metadata consistent.
92fc988 to
976f325
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
No description provided.