Fix missing ReLU in GLM-MOE-DSA indexer scoring by gambletan · Pull Request #44690 · huggingface/transformers

gambletan · 2026-03-14T03:44:36Z

Summary

The GlmMoeDsaIndexer is missing a ReLU activation on the per-head dot-product scores before the weighted sum across heads. The reference DeepSeek V3.2 implementation applies ReLU inside the fp8_index kernel:

# Reference: inference/kernel.py – fp8_index_kernel
logits[i3_n, i_h] = T.max(logits[i3_n, i_h], 0) * q_s_frag[i_h]

The computation flow in the kernel is:

logits = q @ k^T (per-head dot products)
logits = relu(logits) * weights (ReLU, then multiply by head weights)
index_score = sum_h(logits) (reduce across heads)

The HF bf16 equivalent was missing step 2's ReLU. Without it, negative attention scores incorrectly contribute to index scoring, which can affect top-k token selection for sparse attention.

Change

Added torch.nn.functional.relu(scores) after the per-head q·k^T computation and before the weighted sum, in both modular_glm_moe_dsa.py and modeling_glm_moe_dsa.py:

  scores = torch.einsum("bshd,btd->bsht", q.float(), k_cached.float()) * self.softmax_scale
+
+ # ReLU matches the reference fp8_index kernel: T.max(logits, 0) before weighting
+ scores = torch.nn.functional.relu(scores)
+
  index_scores = torch.einsum("bsht,bsh->bst", scores, weights)

Test plan

Verify model outputs match the reference DeepSeek V3.2 implementation more closely with this fix
Run existing GLM-MOE-DSA tests to confirm no regressions

The DSA indexer was missing a ReLU activation on the per-head dot-product scores before the weighted sum across heads. The reference DeepSeek V3.2 implementation applies ReLU inside the fp8_index kernel via `T.max(logits, 0)` before multiplying by head weights. Without this, negative attention scores incorrectly contribute to the index scoring, which can affect top-k token selection for sparse attention. Fixes huggingface#44360 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-14T03:45:38Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm_moe_dsa

Rocketknight1 · 2026-03-18T12:40:23Z

Running a code agent to spam our notifications with a redundant PR when there's a maintainer PR already at #44564 is a good way to get blocked - be careful!

Rocketknight1 closed this Mar 18, 2026

Rocketknight1 added the Code agent slop label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing ReLU in GLM-MOE-DSA indexer scoring#44690

Fix missing ReLU in GLM-MOE-DSA indexer scoring#44690
gambletan wants to merge 1 commit intohuggingface:mainfrom
gambletan:fix/glm-moe-dsa-relu

gambletan commented Mar 14, 2026

Uh oh!

github-actions Bot commented Mar 14, 2026

Uh oh!

Rocketknight1 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gambletan commented Mar 14, 2026

Summary

Change

Test plan

Uh oh!

github-actions Bot commented Mar 14, 2026

Uh oh!

Rocketknight1 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants