[atom-vllm benchmark] remove flydsl gdr kernel for q3next TP1 benchmark case by zejunchen-zejun · Pull Request #671 · ROCm/ATOM

zejunchen-zejun · 2026-04-29T12:00:30Z

No description provided.

q3next TP1 benchmark case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot

Pull request overview

Removes the FlyDSL GDR decode kernel toggle for the Qwen3-Next-80B-A3B-Instruct-FP8 TP1 out-of-the-box benchmark configuration, so that this benchmark runs without ATOM_USE_FLYDSL_GDR=1 enabled by default.

Changes:

Drop ATOM_USE_FLYDSL_GDR=1 from the TP1 benchmark entry’s env_vars for Qwen3-Next-80B-A3B-Instruct-FP8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T13:09:54Z

+        return gemma_rmsnorm_triton(
+            x, self.weight.data, self.variance_epsilon, residual
+        )


GemmaRMSNorm.forward_cuda now returns the raw result of gemma_rmsnorm_triton, which is always a tuple (out, residual_out|None). When residual is None, callers like hidden_states = self.input_layernorm(hidden_states) expect a Tensor, not (Tensor, None), which will break at runtime. Unwrap the return to match forward_static semantics: return out when residual is None, else return (out, residual_out).

Suggested change

return gemma_rmsnorm_triton(

x, self.weight.data, self.variance_epsilon, residual

)

out, residual_out = gemma_rmsnorm_triton(

x, self.weight.data, self.variance_epsilon, residual

)

return out if residual is None else (out, residual_out)

Copilot · 2026-04-29T13:09:54Z

+        from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton



GemmaRMSNorm.forward() always routes to forward_cuda() when not using the fused-quant path; with this change forward_cuda() unconditionally imports/uses Triton. This removes the CPU-safe fallback that previously existed (it used PyTorch ops) and will fail if x is on CPU or Triton isn't available. Consider guarding on x.is_cuda/ROCm and falling back to forward_native otherwise.

Suggested change

from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton

if not x.is_cuda:

return self.forward_native(x, residual)

try:

from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton

except (ImportError, ModuleNotFoundError):

return self.forward_native(x, residual)

Copilot · 2026-04-29T13:09:55Z

+Replaces the torch.compile'd GemmaRMSNorm.forward_static with a single Triton
+kernel that fuses residual add and RMS normalization with the Gemma-style
+weight offset: out = rmsnorm(x + residual) * (1 + w).
+
+Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset.
+
+Two custom ops are registered so that torch.compile (Dynamo) can trace through
+them without falling back to the PyTorch implementation that contains
+x.float() / x.to(orig_dtype) dtype-cast copy kernels:
+  - ``fused_gemma_rmsnorm``  (no residual)
+  - ``fused_gemma_add_rmsnorm``  (with residual add)


The module docstring says “Two custom ops are registered …” (fused_gemma_rmsnorm / fused_gemma_add_rmsnorm), but this file only defines a Triton kernel + Python launcher and does not register any torch.library/custom op. Either implement the custom op registration (so torch.compile can trace it as intended) or update the docstring to accurately describe the current behavior (likely a graph break under Dynamo).

Suggested change

Replaces the torch.compile'd GemmaRMSNorm.forward_static with a single Triton

kernel that fuses residual add and RMS normalization with the Gemma-style

weight offset: out = rmsnorm(x + residual) * (1 + w).

Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset.

Two custom ops are registered so that torch.compile (Dynamo) can trace through

them without falling back to the PyTorch implementation that contains

x.float() / x.to(orig_dtype) dtype-cast copy kernels:

- ``fused_gemma_rmsnorm`` (no residual)

- ``fused_gemma_add_rmsnorm`` (with residual add)

Provides a Triton kernel and Python launcher for GemmaRMSNorm with optional

residual add, fusing residual addition and RMS normalization with the

Gemma-style weight offset: out = rmsnorm(x + residual) * (1 + w).

Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset.

This module does not register ``torch.library`` custom ops; it only exposes

the Triton kernel and the ``gemma_rmsnorm_triton`` launcher defined below.

As a result, ``torch.compile``/Dynamo will not see custom-op registrations

from this file alone.

Copilot · 2026-04-29T13:09:55Z

+            x = x + res
+            # Store residual_out (needed by next layer)
+            res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets
+            res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,))
+            tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask)
+
+        x = x.to(tl.float32)
+


In the residual path, the kernel adds x = x + res before promoting to fp32 (x = x.to(tl.float32)), which means fp16 inputs will accumulate the residual in fp16 (lower precision) unlike GemmaRMSNorm.forward_static which explicitly does fp32 addition for fp16. To preserve numerical behavior, cast x and res to fp32 before the add (and only cast once).

Suggested change

x = x + res

# Store residual_out (needed by next layer)

res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets

res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,))

tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask)

x = x.to(tl.float32)

x = x.to(tl.float32) + res.to(tl.float32)

# Store residual_out (needed by next layer)

res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets

res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,))

tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask)

else:

x = x.to(tl.float32)

Copilot · 2026-04-29T13:09:55Z

+              pip install -r requirements.txt && \\
+              git checkout "${AITER_COMMIT}" && \\


This shallow clone (--depth 1) is incompatible with checking out an arbitrary ${AITER_COMMIT}: if the commit isn’t the repo’s current HEAD, git checkout will fail because the object isn’t in the shallow history. Remove --depth 1 or fetch the specific commit/branch before checkout (e.g., git fetch --depth 1 origin <sha>).

Suggested change

pip install -r requirements.txt && \\

git checkout "${AITER_COMMIT}" && \\

git fetch --depth 1 origin "${AITER_COMMIT}" && \\

git checkout "${AITER_COMMIT}" && \\

pip install -r requirements.txt && \\

Copilot · 2026-04-29T13:09:56Z

+              pip install -r requirements.txt && \\
+              git checkout "${AITER_COMMIT}" && \\
              git submodule sync && git submodule update --init --recursive && \\


pip install -r requirements.txt is run before git checkout "${AITER_COMMIT}", so dependencies are installed from whatever default-branch revision the shallow clone initially pulled rather than the selected commit. Checkout (and submodules) should happen before installing requirements to keep the environment consistent with the chosen AITER revision.

Suggested change

pip install -r requirements.txt && \\

git checkout "${AITER_COMMIT}" && \\

git submodule sync && git submodule update --init --recursive && \\

git fetch --depth 1 origin "${AITER_COMMIT}" && \\

git checkout "${AITER_COMMIT}" && \\

git submodule sync && git submodule update --init --recursive && \\

pip install -r requirements.txt && \\

Copilot · 2026-04-29T13:09:56Z

        """
        Forward pass with three parts:
        1. Input projection
        2. Core attention (custom op)


The docstring says “Forward pass with three parts” but the enumerated list now only includes parts 1 and 2, while the code still has a “Part 3: Output Projection” section below. Update the docstring list to include the third part again (or adjust wording) to keep documentation consistent.

Suggested change

2. Core attention (custom op)

2. Core attention (custom op)

3. Output projection

Copilot · 2026-04-29T13:09:56Z

+    BLOCK_SIZE = triton.next_power_of_2(n_cols)
+    NUM_PRGMS = min(n_rows, 304)  # MI355X has 304 CUs
+


NUM_PRGMS = min(n_rows, 304) hard-codes MI355X’s CU count, which may under/over-subscribe other GPUs and makes performance tuning device-specific in a generic op. Consider deriving this from device properties (if available) or making it a configurable heuristic keyed off the active GPU arch.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T14:34:28Z

+              pip install -r requirements.txt && \\
+              git checkout "${AITER_COMMIT}" && \\
              git submodule sync && git submodule update --init --recursive && \\


The AITER install in the generated Dockerfile uses git clone --depth 1 ... and then git checkout "${AITER_COMMIT}". If ${AITER_COMMIT} is not the repo’s default-branch HEAD, the checkout will fail in a shallow clone. Consider cloning without --depth 1, or fetching the specific commit (git fetch --depth 1 origin ${AITER_COMMIT}) before checkout; also, install requirements.txt after checking out the target commit so dependencies match that revision.

Suggested change

pip install -r requirements.txt && \\

git checkout "${AITER_COMMIT}" && \\

git submodule sync && git submodule update --init --recursive && \\

git fetch --depth 1 origin "${AITER_COMMIT}" && \\

git checkout "${AITER_COMMIT}" && \\

git submodule sync && git submodule update --init --recursive && \\

pip install -r requirements.txt && \\

Copilot · 2026-04-29T14:34:29Z

      && needs.build-benchmark-matrix.result == 'success'
      && needs.build-benchmark-matrix.outputs.has_benchmark_cells == 'true'
-    runs-on: build-only-atom
+    runs-on: atom-mi355-8gpu-oot-benchmark


runs-on was changed to atom-mi355-8gpu-oot-benchmark, but this runner label is not present in .github/runner-config.yml (which is used for dashboard GPU arch/count metadata and explicitly says it must be updated when runners change). Please add a mapping entry for atom-mi355-8gpu-oot-benchmark in .github/runner-config.yml to keep CI metadata consistent.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

[atom-vllm benchmark] remove flydsl gdr kernel for

976f325

q3next TP1 benchmark case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot AI review requested due to automatic review settings April 29, 2026 12:00

Copilot started reviewing on behalf of zejunchen-zejun April 29, 2026 12:01 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 29, 2026 13:01

Copilot started reviewing on behalf of ganyi1996ppo April 29, 2026 13:03 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 29, 2026 14:26

Copilot started reviewing on behalf of zejunchen-zejun April 29, 2026 14:28 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

zejunchen-zejun force-pushed the zejun/disable_flydsl_gdn_for_q3next_tp1_0429 branch from 92fc988 to 976f325 Compare April 30, 2026 00:50

Copilot AI review requested due to automatic review settings April 30, 2026 00:50

zejunchen-zejun marked this pull request as draft April 30, 2026 00:51

Copilot started reviewing on behalf of zejunchen-zejun April 30, 2026 00:52 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

zejunchen-zejun added 3 commits April 30, 2026 10:50

update the model

1caa3cf

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

update k2.5 model source

29133bf

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

add suffix name

7cb387f

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

		from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton

-        from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton
+        if not x.is_cuda:
+            return self.forward_native(x, residual)
+        try:
+            from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton
+        except (ImportError, ModuleNotFoundError):
+            return self.forward_native(x, residual)

		pip install -r requirements.txt && \\
		git checkout "${AITER_COMMIT}" && \\

	2. Core attention (custom op)
	2. Core attention (custom op)
	3. Output projection

		BLOCK_SIZE = triton.next_power_of_2(n_cols)
		NUM_PRGMS = min(n_rows, 304) # MI355X has 304 CUs

Conversation

zejunchen-zejun commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants