Skip to content

[atom-vllm benchmark] remove flydsl gdr kernel for q3next TP1 benchmark case#671

Draft
zejunchen-zejun wants to merge 4 commits intomainfrom
zejun/disable_flydsl_gdn_for_q3next_tp1_0429
Draft

[atom-vllm benchmark] remove flydsl gdr kernel for q3next TP1 benchmark case#671
zejunchen-zejun wants to merge 4 commits intomainfrom
zejun/disable_flydsl_gdn_for_q3next_tp1_0429

Conversation

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

No description provided.

q3next TP1 benchmark case

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings April 29, 2026 12:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes the FlyDSL GDR decode kernel toggle for the Qwen3-Next-80B-A3B-Instruct-FP8 TP1 out-of-the-box benchmark configuration, so that this benchmark runs without ATOM_USE_FLYDSL_GDR=1 enabled by default.

Changes:

  • Drop ATOM_USE_FLYDSL_GDR=1 from the TP1 benchmark entry’s env_vars for Qwen3-Next-80B-A3B-Instruct-FP8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 29, 2026 13:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/model_ops/layernorm.py Outdated
Comment on lines +552 to +554
return gemma_rmsnorm_triton(
x, self.weight.data, self.variance_epsilon, residual
)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GemmaRMSNorm.forward_cuda now returns the raw result of gemma_rmsnorm_triton, which is always a tuple (out, residual_out|None). When residual is None, callers like hidden_states = self.input_layernorm(hidden_states) expect a Tensor, not (Tensor, None), which will break at runtime. Unwrap the return to match forward_static semantics: return out when residual is None, else return (out, residual_out).

Suggested change
return gemma_rmsnorm_triton(
x, self.weight.data, self.variance_epsilon, residual
)
out, residual_out = gemma_rmsnorm_triton(
x, self.weight.data, self.variance_epsilon, residual
)
return out if residual is None else (out, residual_out)

Copilot uses AI. Check for mistakes.
Comment thread atom/model_ops/layernorm.py Outdated
Comment on lines 550 to 551
from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GemmaRMSNorm.forward() always routes to forward_cuda() when not using the fused-quant path; with this change forward_cuda() unconditionally imports/uses Triton. This removes the CPU-safe fallback that previously existed (it used PyTorch ops) and will fail if x is on CPU or Triton isn't available. Consider guarding on x.is_cuda/ROCm and falling back to forward_native otherwise.

Suggested change
from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton
if not x.is_cuda:
return self.forward_native(x, residual)
try:
from atom.model_ops.triton_gemma_rmsnorm import gemma_rmsnorm_triton
except (ImportError, ModuleNotFoundError):
return self.forward_native(x, residual)

Copilot uses AI. Check for mistakes.
Comment thread atom/model_ops/triton_gemma_rmsnorm.py Outdated
Comment on lines +3 to +13
Replaces the torch.compile'd GemmaRMSNorm.forward_static with a single Triton
kernel that fuses residual add and RMS normalization with the Gemma-style
weight offset: out = rmsnorm(x + residual) * (1 + w).

Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset.

Two custom ops are registered so that torch.compile (Dynamo) can trace through
them without falling back to the PyTorch implementation that contains
x.float() / x.to(orig_dtype) dtype-cast copy kernels:
- ``fused_gemma_rmsnorm`` (no residual)
- ``fused_gemma_add_rmsnorm`` (with residual add)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says “Two custom ops are registered …” (fused_gemma_rmsnorm / fused_gemma_add_rmsnorm), but this file only defines a Triton kernel + Python launcher and does not register any torch.library/custom op. Either implement the custom op registration (so torch.compile can trace it as intended) or update the docstring to accurately describe the current behavior (likely a graph break under Dynamo).

Suggested change
Replaces the torch.compile'd GemmaRMSNorm.forward_static with a single Triton
kernel that fuses residual add and RMS normalization with the Gemma-style
weight offset: out = rmsnorm(x + residual) * (1 + w).
Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset.
Two custom ops are registered so that torch.compile (Dynamo) can trace through
them without falling back to the PyTorch implementation that contains
x.float() / x.to(orig_dtype) dtype-cast copy kernels:
- ``fused_gemma_rmsnorm`` (no residual)
- ``fused_gemma_add_rmsnorm`` (with residual add)
Provides a Triton kernel and Python launcher for GemmaRMSNorm with optional
residual add, fusing residual addition and RMS normalization with the
Gemma-style weight offset: out = rmsnorm(x + residual) * (1 + w).
Based on aiter's _fused_add_rmsnorm_kernel with the (g + 1.0) Gemma offset.
This module does not register ``torch.library`` custom ops; it only exposes
the Triton kernel and the ``gemma_rmsnorm_triton`` launcher defined below.
As a result, ``torch.compile``/Dynamo will not see custom-op registrations
from this file alone.

Copilot uses AI. Check for mistakes.
Comment thread atom/model_ops/triton_gemma_rmsnorm.py Outdated
Comment on lines +61 to +68
x = x + res
# Store residual_out (needed by next layer)
res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets
res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,))
tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask)

x = x.to(tl.float32)

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the residual path, the kernel adds x = x + res before promoting to fp32 (x = x.to(tl.float32)), which means fp16 inputs will accumulate the residual in fp16 (lower precision) unlike GemmaRMSNorm.forward_static which explicitly does fp32 addition for fp16. To preserve numerical behavior, cast x and res to fp32 before the add (and only cast once).

Suggested change
x = x + res
# Store residual_out (needed by next layer)
res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets
res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,))
tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask)
x = x.to(tl.float32)
x = x.to(tl.float32) + res.to(tl.float32)
# Store residual_out (needed by next layer)
res_out_ptrs = res_out_ptr + row_idx * input_row_stride + col_offsets
res_out_ptrs = tl.multiple_of(res_out_ptrs, (16,))
tl.store(res_out_ptrs, x.to(res_out_ptr.dtype.element_ty), mask=mask)
else:
x = x.to(tl.float32)

Copilot uses AI. Check for mistakes.
Comment on lines +530 to +531
pip install -r requirements.txt && \\
git checkout "${AITER_COMMIT}" && \\
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shallow clone (--depth 1) is incompatible with checking out an arbitrary ${AITER_COMMIT}: if the commit isn’t the repo’s current HEAD, git checkout will fail because the object isn’t in the shallow history. Remove --depth 1 or fetch the specific commit/branch before checkout (e.g., git fetch --depth 1 origin <sha>).

Suggested change
pip install -r requirements.txt && \\
git checkout "${AITER_COMMIT}" && \\
git fetch --depth 1 origin "${AITER_COMMIT}" && \\
git checkout "${AITER_COMMIT}" && \\
pip install -r requirements.txt && \\

Copilot uses AI. Check for mistakes.
Comment on lines 530 to 532
pip install -r requirements.txt && \\
git checkout "${AITER_COMMIT}" && \\
git submodule sync && git submodule update --init --recursive && \\
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip install -r requirements.txt is run before git checkout "${AITER_COMMIT}", so dependencies are installed from whatever default-branch revision the shallow clone initially pulled rather than the selected commit. Checkout (and submodules) should happen before installing requirements to keep the environment consistent with the chosen AITER revision.

Suggested change
pip install -r requirements.txt && \\
git checkout "${AITER_COMMIT}" && \\
git submodule sync && git submodule update --init --recursive && \\
git fetch --depth 1 origin "${AITER_COMMIT}" && \\
git checkout "${AITER_COMMIT}" && \\
git submodule sync && git submodule update --init --recursive && \\
pip install -r requirements.txt && \\

Copilot uses AI. Check for mistakes.
Comment thread atom/models/qwen3_5.py
"""
Forward pass with three parts:
1. Input projection
2. Core attention (custom op)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says “Forward pass with three parts” but the enumerated list now only includes parts 1 and 2, while the code still has a “Part 3: Output Projection” section below. Update the docstring list to include the third part again (or adjust wording) to keep documentation consistent.

Suggested change
2. Core attention (custom op)
2. Core attention (custom op)
3. Output projection

Copilot uses AI. Check for mistakes.
Comment thread atom/model_ops/triton_gemma_rmsnorm.py Outdated
Comment on lines +98 to +100
BLOCK_SIZE = triton.next_power_of_2(n_cols)
NUM_PRGMS = min(n_rows, 304) # MI355X has 304 CUs

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUM_PRGMS = min(n_rows, 304) hard-codes MI355X’s CU count, which may under/over-subscribe other GPUs and makes performance tuning device-specific in a generic op. Consider deriving this from device properties (if available) or making it a configurable heuristic keyed off the active GPU arch.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 29, 2026 14:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 566 to 568
pip install -r requirements.txt && \\
git checkout "${AITER_COMMIT}" && \\
git submodule sync && git submodule update --init --recursive && \\
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AITER install in the generated Dockerfile uses git clone --depth 1 ... and then git checkout "${AITER_COMMIT}". If ${AITER_COMMIT} is not the repo’s default-branch HEAD, the checkout will fail in a shallow clone. Consider cloning without --depth 1, or fetching the specific commit (git fetch --depth 1 origin ${AITER_COMMIT}) before checkout; also, install requirements.txt after checking out the target commit so dependencies match that revision.

Suggested change
pip install -r requirements.txt && \\
git checkout "${AITER_COMMIT}" && \\
git submodule sync && git submodule update --init --recursive && \\
git fetch --depth 1 origin "${AITER_COMMIT}" && \\
git checkout "${AITER_COMMIT}" && \\
git submodule sync && git submodule update --init --recursive && \\
pip install -r requirements.txt && \\

Copilot uses AI. Check for mistakes.
&& needs.build-benchmark-matrix.result == 'success'
&& needs.build-benchmark-matrix.outputs.has_benchmark_cells == 'true'
runs-on: build-only-atom
runs-on: atom-mi355-8gpu-oot-benchmark
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs-on was changed to atom-mi355-8gpu-oot-benchmark, but this runner label is not present in .github/runner-config.yml (which is used for dashboard GPU arch/count metadata and explicitly says it must be updated when runners change). Please add a mapping entry for atom-mi355-8gpu-oot-benchmark in .github/runner-config.yml to keep CI metadata consistent.

Copilot uses AI. Check for mistakes.
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/disable_flydsl_gdn_for_q3next_tp1_0429 branch from 92fc988 to 976f325 Compare April 30, 2026 00:50
Copilot AI review requested due to automatic review settings April 30, 2026 00:50
@zejunchen-zejun zejunchen-zejun marked this pull request as draft April 30, 2026 00:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants