Fix CPU Attention softcap/attn_mask ordering (onnx#7867, #7913) + consolidate CUDA spec coverage tests from #28371#28379
Conversation
… (canary) This commit adds 4 new tests (2 Python, 2 C++) that verify ONNX Attention opset 23/24 spec ordering -- scale*QK -> softcap -> add bias/mask -> softmax, per onnx/onnx#7867 (which superseded the now-closed onnx/onnx#7865 issue) and onnx/onnx#7913 (which swapped qk_matmul_output_mode values 1 and 2 to align with the corrected pipeline). The tests are written using the small-softcap + poison-V technique already established by the existing CUDA-only guards at: - test/python/transformers/test_onnx_attention/test_gqa.py:1501 (test_gqa_large_head_unfused_softcap_additive_mask_poison_fp16) - test/python/transformers/test_onnx_attention/test_gqa.py:1761 (test_mea_gqa_softcap_mask_ordering_no_leakage_prompt_fp16) If softcap is applied AFTER mask-add, then tanh(-inf/softcap)*softcap = -softcap (a finite value), which leaks probability through softmax to the masked position. With V=1000 placed at the masked position, the wrong order produces output ~155 (C++) / ~120 (Python) instead of the spec-correct ~0.2. CANARY EVIDENCE (this commit, no production code change): C++ AttentionTest.Attention_Unfused_Softcap_NegInfMask_PoisonV_CPU FAILED -- cur_actual = 155.531, cur_expected = 0.200, delta 155.33 C++ AttentionTest.Attention_Unfused_Softcap_NegInfMask_PoisonV_CUDA PASSED (sentinel: CUDA was already spec-correct) Python TestONNXAttentionCPUSoftcapMaskOrdering .test_cpu_attention_softcap_additive_mask_poison_prompt_fp32 FAILED -- max |output| = (above 50 threshold) Python TestONNXAttentionCPUSoftcapMaskOrdering .test_cpu_attention_softcap_mask_ordering_no_leakage_prompt_fp32 FAILED -- max |output| = 120.83 Python existing CUDA guards (sentinel sanity): test_gqa_large_head_unfused_softcap_additive_mask_poison_fp16 PASSED test_mea_gqa_softcap_mask_ordering_no_leakage_prompt_fp16 PASSED Also refreshes the SKILL.md citations from onnx/onnx#7865 to onnx/onnx#7867 + onnx/onnx#7913 in section 1 (MEA eligibility) and section 4 (Softcap Ordering). Section 4 is rewritten to spell out the full pipeline and to reference the new CPU-side guard tests. The CPU production fix that flips these CPU canaries from FAIL to PASS is intentionally split into the next commit, so CI publicly records the FAIL -> PASS transition and proves the new tests actually exercise the ordering bug. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…microsoft#7913) The CPU ONNX Attention op was applying mask/bias BEFORE softcap, violating the v23/24 spec post onnx/onnx#7867 + onnx/onnx#7913. With finite softcap active, -inf mask values were squashed by tanh into bounded -c, causing probability leakage onto masked positions and arithmetic mixing of poison V-values into the output (the canary tests in commit d76e45c demonstrate the leak: max |y| > 100 with poison V at a -inf-masked position). This is the inverse fix to PR microsoft#28370: CUDA was already correct per spec (it splits the softcap snapshot before the mask add); CPU is the violator. Changes: - ComputeAttentionProbs<T>: refactor the per-head loop. When softcap is active, run GEMM with beta=0 (raw scale*Q*K^T), apply softcap inplace, then add mask explicitly via new AddInPlace<T> helper. When softcap is disabled, preserve the original beta=1 fold path (mask preloaded into C, FMA-accumulated) so pre-existing test calibrations remain numerically identical. Snapshot offsets for kQK / kPostSoftCap / kPostMaskBias / kPostSoftMax now follow the spec-correct stage order. - AddInPlace<T>: new helper. Uses MlasEltwiseAdd<float> for fp32; FP16 takes a scalar-fallback path because MlasEltwiseAdd<MLAS_FP16>'s dispatch->Add_Fp16 is not populated on all builds. - attention_parameters.h: rename QKMatMulOutputMode enumerators per microsoft#7913 numbering swap. Old: kQK=0, kQKMask=1, kQKSoftCap=2, kQKSoftMax=3. New: kQK=0, kPostSoftCap=1, kPostMaskBias=2, kPostSoftMax=3. - cuda/llm/attention.cc: update enum-tag references in comments and the qk_matmul_output_mode ENFORCE message (no logic change; CUDA was already spec-correct). - attention_op_test.cc Attention4DWithPastAndPresentQkMatmul: regenerated expected y[] and qk_matmul[] arrays for the softcap=1.0 sub-call. Pre-fix arrays were calibrated against the OLD buggy ordering (softmax(softcap(raw+mask))) and necessarily change once the spec is honored (softmax(softcap(raw)+mask)). Five other sub-calls in the same test (modes -1/0/1/2/3, no softcap) are unchanged because mask add trivially commutes with the no-op softcap. With this change, the canary tests added in d76e45c transition from FAIL->PASS on CPU EP while CUDA EP guards continue to pass: - AttentionTest.Attention_Unfused_Softcap_NegInfMask_PoisonV_CPU - AttentionTest.Attention_Unfused_Softcap_NegInfMask_PoisonV_CUDA - TestONNXAttentionCPUSoftcapMaskOrdering.test_cpu_attention_softcap_* Verification: 58/58 AttentionTest.* + 19/19 GQATest.* + 217/218 Python transformers/test_onnx_attention (the one failure is an unrelated pre-existing tensorscatter validation test, unmodified by this PR). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lded, doc polish
Applies the consolidated findings from the PR-1-v2 review pass
(lead-39245992/pr1v2-review-{code,critical,readability}.md, pr1v2-qa.md):
- (Critical M1, blocking) Independent oracle verification of the regenerated
baseline values for `Attention4DWithPastAndPresentQkMatmul`'s softcap=1.0
sub-call. attention_ref() (test_onnx_attention/common.py) was used as a
spec-correct ground-truth oracle outside the just-modified CPU code path.
Result: max |y - oracle| = 5.4e-7, max |qk_matmul - oracle| = 6.0e-7,
3+ orders of magnitude inside the fp16 tolerance (atol=rtol=1e-3) used
elsewhere in the suite. Regen confirmed spec-correct (not circular).
Verification snippet preserved at lead-39245992/pr1v2-oracle-verify.py.
A back-reference comment was added immediately above the regenerated
arrays.
- (Critical M2) Breaking-change callout for the `qk_matmul_output_mode`
enum value swap (onnx/onnx#7913) added to lead-39245992/pr1v2-description.md.
Includes the old/new value table and notes that no in-tree consumer pins
the old numbering.
- (Code minor + Readability M2) DRY the `mask_was_folded` predicate.
The 5-clause GEMM-fold condition was previously duplicated at the if/else
that selects beta and again at the post-softcap mask-add. Captured once
into `fold_mask_into_gemm` at first use and reused; deleted the duplicate
predicate. Single source of truth for "did the mask get baked into C via
beta=1?". Comment block updated accordingly.
- (Readability M3) Reconciled AddInPlace<MLFloat16> rationale. The source
comment now states the precise reason: MlasEltwiseAdd<MLAS_FP16> requires
the per-platform EltwiseDispatch->Add_Fp16 kernel slot to be populated,
and only the ARM NEON build provides it (see
onnxruntime/core/mlas/lib/eltwise.cpp:92-103); x86 and other targets
would throw at runtime. This matches the precise wording used in the
original commit-2 message body.
Verification at HEAD:
- 58/58 AttentionTest.* PASS (incl. both PoisonV canaries +
Attention4DWithPastAndPresentQkMatmul).
- All `*GQA*` test-suite tests PASS.
- 3/3 Python canaries (CPU additive mask, CPU mask ordering, CUDA mask
ordering) PASS.
- Lintrunner clean.
Deferred per task brief:
- M4 (literal `1->2` justification at attention_op_test.cc:1782) - covered
by the new regen comment block.
- Cross-PR SKILL.md \u00a71 coordination with PR microsoft#2 (readability M1) -
handled at merge time.
- All reviewer nits.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Aligns ONNX Attention handling with the corrected softcap→mask/bias ordering and updates the auxiliary qk_matmul_output_mode naming/mapping used by the CPU/CUDA Attention codepaths and tests.
Changes:
- Refactors the CPU Attention score pipeline so softcap is applied before additive mask/bias, with updated snapshot handling for
output_qk. - Swaps/renames
qk_matmul_output_modeenum values 1 and 2 to match the corrected stage ordering. - Adds/updates Python and C++ regression coverage plus related CUDA skill/docs comments.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py |
Adds CPU-side Python canaries for softcap/mask leakage behavior. |
onnxruntime/test/providers/cpu/llm/attention_op_test.cc |
Updates qk_matmul baselines and adds C++ poison-V regression tests. |
onnxruntime/core/providers/cuda/llm/attention.cc |
Renames accepted qk_matmul_output_mode enum references/messages on CUDA. |
onnxruntime/core/providers/cpu/llm/attention.cc |
Implements the CPU production fix for score ordering and output snapshot staging. |
onnxruntime/core/providers/cpu/llm/attention_parameters.h |
Renames/swaps QKMatMulOutputMode enum values to the corrected meanings. |
.agents/skills/cuda-attention-kernel-patterns/SKILL.md |
Updates CUDA Attention skill guidance to reflect the corrected ordering/history. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -471,24 +536,19 @@ void AttentionBase<T>::ComputeAttentionProbs(T* attention_probs, | |||
| mask_filter_value<T>()); | |||
| RunTest4D(batch_size, q_num_heads, q_sequence_length, head_size, kv_sequence_length, kv_num_heads, v_head_size, past_sequence_length, | ||
| q, k, v, m, std::initializer_list<bool>(), past_key, past_value, | ||
| -1, 1, std::numeric_limits<float>::quiet_NaN(), std::numeric_limits<float>::quiet_NaN(), -1, TensorType::kFloat, // is_causal, qk_matmul_output_mode, scale, softcap, softmax_precision, tensor_type |
There was a problem hiding this comment.
Review — multi-model pass (Opus 4.7 lead, with GPT-5.3-Codex / GPT-5.5 / Sonnet 4.6 reviewers)
Edit (post-discussion with author). Finding #1 below was originally framed as “add opset-version gating to preserve as-shipped opset-23 semantics.” That framing was wrong: opset 23 shipped with the leakage bug in both spec text and reference impl, and onnx/onnx#7867 + #7913 are post-release errata. Preserving the as-shipped opset-23 behavior would mean knowingly keeping the leak. The author’s position — that ORT should honor the corrected ordering/numbering for both v23 and v24 ahead of the ONNX v1.22 spec text — is the right call. Finding #1 is therefore reframed as a documentation ask, not a behavior change. Finding #2 (no differentiating test) becomes more important under this stance, since the test suite is the de-facto contract pinning ORT’s chosen interpretation.
The CPU softcap-before-mask fix itself is the right call: poison-V tests are an excellent oracle, the fold_mask_into_gemm refactor is clean, and the deep-copy-before-mutate snapshot discipline avoids the obvious alias trap. Mask handling for -inf is now correct.
That said, the cross-family review surfaced two non-trivial concerns and a handful of polish items.
🔴 Major
1. Document that ORT intentionally leads the spec on qk_matmul_output_mode semantics for opset 23.
The CPU kernel registers separate v23 and v24 kernels (cpu/llm/attention.cc:25-49), and CUDA does the same (cuda/llm/attention.cc:40-69), but both versions instantiate the same Attention<T> whose constructor reads the integer with no info.node().SinceVersion() check. Per onnx/onnx#7913 the integers 1 and 2 were swapped between the as-shipped opset 23 and the corrected pipeline, so an opset-23 model carrying value 1 will resolve to kPostSoftCap instead of the as-shipped “post-mask/bias” meaning.
This is intentional and correct — opset 23 shipped with both the wrong ordering and the wrong numbering, and #7867/#7913 are post-release errata that ORT is implementing ahead of ONNX v1.22. But the comment in attention_parameters.h:14 currently just says “v23/24 use the new numbering,” which reads as a passive statement of fact and obscures that ORT is deliberately diverging from the as-shipped opset-23 spec text.
Suggested fix: expand the comment to make the intent explicit, e.g. “ORT honors the corrected #7867/#7913 ordering/numbering for both opset 23 and opset 24, ahead of the ONNX v1.22 spec text errata. Producers that emitted opset-23 nodes against the as-shipped numbering will see the swapped auxiliary output post-merge; the as-shipped behavior is the leakage bug being fixed.” That makes the policy choice deliberate rather than accidental, and gives downstream tooling a clear pointer.
2. The enum swap is not actually exercised by a differentiating test.
attention_op_test.cc:1544-1571 updates the QK-output test, but the mode-1 case runs with softcap = NaN (i.e. softcap disabled), where kPostSoftCap and kQK are observationally identical. The mode-2 case is also softcap-off, so it only verifies the no-softcap commutative case. The riskiest behavior change in this PR — that mode 1 returns softcap(scale*QK) and mode 2 returns softcap(scale*QK) + mask/bias — is not pinned by any test.
This is more important than it would be under a strict version-gated implementation: because ORT is intentionally leading the spec, the test suite is the contract that documents the chosen interpretation. Without a differentiating test, a future refactor could quietly invert the snapshots and CI would not notice.
Suggested fix: one C++ CPU test with softcap > 0, non-zero additive mask, output-3 wired, and explicit expected values for modes 1 and 2. Bonus: a third case with nonpad_kv_seqlen enabled to lock in that kPostMaskBias includes nonpad masking (currently a defensible-but-undocumented choice — see polish item below).
🟡 Minor
3. Naming inconsistency in QKMatMulOutputMode (attention_helper.h).
Three of the four values follow a kPost* "stage you exited" convention; kQK is the lone outlier and its name implies "Q times K" even though scaling has already been applied. Renaming to kPostGemm or kPreSoftCap would make the enum self-describing. (Acknowledged: the names mirror the spec text, so this may be intentional — flagging for consideration.)
4. snapshot_needs_pre_mask conflates two orthogonal conditions (attention.cc:~153). The expression gates on both out_qk != nullptr and the mode, but the name only describes the mode half. Splitting into a mode-only predicate (mode_requires_pre_mask_snapshot) and combining with the buffer-presence check at the use site would make the intent obvious. Not a correctness bug, but a trap for the next maintainer.
5. Silent perf regression on the softcap + mask path (attention.cc:504-526). With softcap enabled, the mask can no longer fold into the GEMM beta=1, so the path becomes GEMM(beta=0) → in-place softcap → separate AddInPlace over the full S×T scores per head. That is one extra read/write pass over the score matrix — measurable for long-context CPU attention. Not a regression to revert (correctness wins), but worth a benchmark and possibly a fused softcap+mask pass for follow-up.
6. New MLFloat16 AddInPlace fallback lacks focused coverage (attention.cc:111-115, 148-151). The scalar MLFloat16(scores[i].ToFloat() + addend[i].ToFloat()) path only fires off-ARM-NEON, and fp16 numerics are exactly where ordering changes are most likely to surface small parity drifts. The new Python ordering tests are fp32-only by design (per the comment at test_gqa.py:1931-1936), so the fp16 path on x86 has no targeted softcap+mask test.
🟢 Polish
7. kPostMaskBias snapshot is taken after the nonpad_kv_seqlen mask (attention.cc:540-543). Defensible — “post-mask/bias” arguably encompasses nonpad masking — but worth a one-line comment so the next reader doesn’t wonder.
8. "Pre-fix / post-fix" language in test_gqa.py class docstring and SKILL.md. Useful for the PR conversation, confusing once merged. Suggest replacing with the evergreen invariant: "any regression to wrong ordering would produce max |output| ≈ poison_value." Same for the SKILL.md parenthetical.
9. Line-number references in SKILL.md (test_gqa.py:1501, :1761) will drift on the next change to that file. Reference the test function names instead.
10. MlasEltwiseAdd<float>(addend, scores, scores, count) aliased pointers. Verified safe — MLAS loads both operands before storing, and the same pattern exists at attention_helper.h:37 and in dynamic_quantize_matmul.cc:316. Not an issue, just noting it was checked.
✅ Praise
- The
fold_mask_into_gemm/softcap_activefactoring with the 8-line "why" comment is the right density for a correctness-sensitive invariant. - Poison-V tests are a much stronger oracle than “high-softcap ≈ identity” checks.
- Pipeline diagram in
SKILL.md(stage 0→1→2→3 with the integer mapping right below) is the clearest part of the diff. - Deep-
memcpysnapshots before in-place mutation correctly avoid the alias/lifetime traps that this kind of staging refactor often introduces. - Leading the spec on a clear correctness bug (rather than waiting for ONNX v1.22 to make the as-shipped opset-23 behavior officially wrong) is the right product call.
Verification done during review
- All four
qk_matmul_output_mode× softcap × mask combinations traced by hand — flow is internally consistent. - Repo grep + 712-model parse confirmed no in-tree serialized models carry
qk_matmul_output_mode(so CI won’t catch a regression on the chosen interpretation — finding #2 is the safety net). - CUDA path: confirmed rename-only plus existing host guard rejecting modes beyond
kNone/kQK; no new CUDA snapshot behavior introduced. - Confirmed onnx/onnx#7867 and #7913 are post-release errata to opset 23 — the ordering/numbering this PR implements matches the corrected spec text, not the as-shipped opset-23 spec.
Reviewers: Opus 4.7 (lead), GPT-5.3-Codex (code), GPT-5.5 (critical), Sonnet 4.6 (readability).
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
The core fix is correct and well-motivated: the CPU EP now applies softcap before attn_mask/attn_bias, matching both the CUDA EP and the ONNX v23/v24 spec (onnx/onnx#7867 + #7913). The fold_mask_into_gemm refactoring is clean — it preserves FMA-fused numerics for the no-softcap path and only takes the explicit AddInPlace path when ordering matters.
Correctness notes (verified):
fold_mask_into_gemmguards are correct: disabled whensoftcap_activeorsnapshot_needs_pre_mask.nonpad_kv_seqlenplacement after softcap is correct — the overwrite withmask_filter_valueensures positions are effectively masked regardless of any prior softcap transformation.AddInPlace<MLFloat16>scalar fallback has no overflow risk (intermediate computation in fp32, inputs bounded by softcap range).- Snapshot ordering matches the pipeline stages exactly.
Two test coverage gaps (endorsed from prior review threads) remain the main blocker for full confidence:
- Mode 1 (
kPostSoftCap) is not differentiated from mode 0 (kQK) when softcap is active — the test uses softcap=NaN where they are observationally identical. - No test combines
nonpad_kv_seqlenwithsoftcap > 0to guard the reordered path.
These gaps are more important here than usual: because ORT intentionally leads the spec (applying #7913's numbering to both opset 23 and 24), the test suite is the de-facto contract for the chosen semantics.
Overall the fix is the right call. Addressing the two test gaps would let this merge with high confidence.
There was a problem hiding this comment.
Inline findings (posted as body due to GitHub API inline-comment issue on this PR):
onnxruntime/core/providers/cpu/llm/attention.cc line ~469 (fold_mask_into_gemm):
Good design: cleanly separates the legacy FMA path (no numeric regression) from the spec-correct path (softcap then explicit mask add). The three conditions are exactly the cases that must avoid folding:
softcap_active: mask must come after softcap per specsnapshot_needs_pre_mask: kQK/kPostSoftCap snapshots must not include the mask
Verified correct.
onnxruntime/core/providers/cpu/llm/attention.cc line ~107 (AddInPlace):
The MLAS fp16 limitation rationale is well-documented. If a future commit adds x86 fp16 MLAS support, this fallback should be gated behind a runtime check rather than being unconditional.
onnxruntime/test/providers/cpu/llm/attention_op_test.cc line ~1552:
Endorsing the earlier feedback: since this test uses softcap = NaN (disabled), mode 1 (kPostSoftCap) produces the same tensor as mode 0. A differentiating test with softcap > 0 would lock in the new semantics.
Tianlei BLOCKER microsoft#1: New mode-1+softcap differentiating test (C++ + Python). With softcap > 0 active, qk_matmul_output_mode=1 (post-microsoft#7913 numbering = kPostSoftCap) snapshots softcap*tanh(scale*QK/softcap) with NO mask added. Without softcap, mode 1 aliases mode 0, so the swap is observationally indistinguishable — this test is what proves the 1<->2 swap actually changed semantics correctly. Tianlei BLOCKER microsoft#2: New softcap+nonpad_kv_seqlen leakage test (C++ + Python). Exercises the latent fix where the nonpad sentinel is now applied AFTER softcap (per onnx#7867 ordering). Pre-fix: tanh squashed the sentinel, leaking poison V at padded positions through softmax. Bot inline minors: - microsoft#3 (test_gqa.py): clarify fp16 docstring — CPU does support fp16; fp32 is the natural EP-native dtype for the canary. - microsoft#4 (attention_op_test.cc): regen comment now cites shared opset 23/24 ordering and notes RunTest4D builds at opset 23. - microsoft#5 (attention_parameters.h): typo defintion -> definition. - microsoft#6 (attention.cc): replace 'guaranteed -inf' with precise wording citing mask_filter_value<T>() = numeric_limits::lowest() / MLFloat16::MinValue sentinel and the MLAS softmax finite-input requirement (attention.h). R-2 microsoft#1 (attention_parameters.h): Spec-leading documentation block on the QKMatMulOutputMode enum noting that ORT now uses the post-onnx#7913 numbering, while the bundled cmake/external/onnx (v1.21.0) still reflects the old numbering. ORT leads the spec change pending the next bundled-ONNX bump. Plumbing: common.py attention_prompt_func gains an optional output_qk kwarg (default 0 / disabled). When > 0, returns a 4-tuple including the qk_matmul snapshot tensor; otherwise unchanged 3-tuple. No existing callers are affected. Test results: - AttentionTest.* — 60/60 PASS (was 58, +2 new). - TestONNXAttentionCPUSoftcapMaskOrdering — 4/4 PASS (was 2, +2 new). - lintrunner clean across all 5 touched files. Refs: lead-39245992/upstream-pr-status-recheck.md, pr1v2-review-{code,critical,readability,qa}.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…g bundled-ONNX update
Per architect 8b9842c3's recommendation (lead-39245992/pr1v2-onnx-fixture-handling.md):
mirror the existing lines 951-960 precedent ("Skipped until cmake/external/onnx
points to onnx 1.19 ... @onnx/onnxmicrosoft/pull/7074") and add a skip-with-cite block
for the attention fixtures regenerated upstream by onnx/onnx#7867 and
onnx/onnx#7913.
The bundled cmake/external/onnx is v1.21.0 (predates both PRs). Our impl
emits the corrected post-spec output, which disagrees with the still-old
fixtures shipped in v1.21.0. Skip until cmake/external/onnx is bumped to
>= v1.22, at which point the entries can be removed in a single cleanup
commit (greppable via 'v1.22 (includes onnx/onnx#7867').
20 entries added (10 base + 10 _expanded):
- 4 softcap-related (cite onnx#7867)
- 14 bias / qk_matmul_output_mode-related (cite onnx#7913)
- 2 mask4d_padded_kv (cite onnx#7867 — same root cause; pre-existing
QNN-only skip at line 1498 promoted to all providers)
Why not bump cmake/external/onnx instead: ONNX v1.22 has not shipped (latest
v1.21.0 = 2026-03-27; microsoft#7867 merged 2026-04-30, microsoft#7913 merged 2026-05-04). A
non-tagged SHA pin would cascade into opset registrations, fusion passes,
function-body decompositions, possibly opset-25 ops, and 80+ unrelated
fixture regenerations from microsoft#7867 alone — out of scope for a CPU behavioral
fix. Bump deserves its own dedicated PR.
Verification (./build/Linux/Debug/onnx_test_runner -e cpu -j 1
cmake/external/onnx/onnx/backend/test/data/node):
- Pre-patch attention failures: 11 (10 from new-spec + 1 mask4d_padded_kv)
- Post-patch attention failures: 0
- Total cases: 1588 -> 1568 (20 skipped, matching added entries)
- Only remaining failure: convinteger_with_padding (pre-existing, unrelated)
- AttentionTest.* still 60/60 PASS
- lintrunner clean
Refs: lead-39245992/pr1v2-onnx-fixture-handling.md (architect 8b9842c3)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks @tianleiwu for the explicit blockers and to @copilot-pull-request-reviewer for the inline catches. Round addresses both blockers + every minor: BLOCKER 1 — mode-1+softcap differentiating coverage. Without BLOCKER 2 — softcap + nonpad_kv_seqlen leakage. The latent ordering bug (sentinel pre-softcap → Spec-leading documentation (R-2 #1 / your spec-versioning concern). A 12-line IMPORTANT block now opens the ONNX backend node tests. 20 fixtures in Bot inlines all applied: typo Verified locally: 60/60 Pushed as 2 additive commits, no force-push, so the review history stays intact: |
…runner filter Hot-fix completes the skip-list coverage started in 0cd61f9. The previous commit added entries to onnxruntime/test/onnx/TestCase.cc::GetBrokenTests, which is consumed by the C++ onnx_test_runner binary only. The Python onnxruntime/test/python/onnx_backend_test_series.py wrapper around onnx.backend.test.runner.Runner uses a SEPARATE filter file: onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc ('current_failing_tests' array). Both runners need their own exclusion. Adds 7 CPU-suffixed entries (immediately after the existing CUDA counterparts at lines 46-54) covering the same fixtures regenerated upstream by onnx/onnx#7867 (softcap-then-mask ordering) and onnx/onnx#7913 (qk_matmul_output_mode 1<->2 numbering): - test_attention_3d_with_past_and_present_qk_matmul_bias_cpu - test_attention_3d_with_past_and_present_qk_matmul_softcap_cpu - test_attention_4d_with_past_and_present_qk_matmul_bias_3d_mask_cpu - test_attention_4d_with_past_and_present_qk_matmul_bias_4d_mask_cpu - test_attention_4d_with_past_and_present_qk_matmul_bias_cpu - test_attention_4d_with_qk_matmul_bias_cpu - test_attention_4d_with_qk_matmul_softcap_cpu Removable in a single cleanup commit (greppable: 'pre-onnx#7867 fixture' / 'pre-onnx#7913 fixture') when cmake/external/onnx is bumped to v1.22+ — same unblock criterion as 0cd61f9. Verification: - JSONC parses cleanly (300 entries in current_failing_tests). - lintrunner clean. - 0 production code touched. Pure additive filter update. Refs: lead-39245992/pr1v2-ci-failure-triage.md (architect 8b9842c3) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rosoft#28371 Migrate the output_qk plumbing in common.py (now `int | None`-typed with strict {0,1,2,3} validation) and three CUDA GQA tests from PR microsoft#28371 to this PR, where they semantically belong: - The output_qk parameter numbering follows the post-onnx#7913 enum swap (kQK/kPostSoftCap/kPostMaskBias/kPostSoftMax) introduced in this PR. - The masked fp32 softcap ordering tests pin the post-onnx#7867 'scale -> softcap -> +mask -> softmax' spec that this PR's CPU implementation enforces. Migrated to test_gqa.py: - TestONNXAttentionGQAOutputQK (1 test: GQA + raw QK output, fp16, unfused path) - TestONNXAttentionGQASoftcapFloat32MaskOrdering (helper + 2 tests: symmetric and asymmetric-V-head poison-V tests on the unfused fp32 GQA path) The unmasked fp32 GQA softcap baseline tests (TestONNXAttentionGQASoftcapFloat32) remain on microsoft#28371 — they are pure CUDA-side softcap coverage that does not depend on the spec ordering or the enum swap. This migration also resolves the textual conflict between microsoft#28371 and microsoft#28379 in common.py and test_gqa.py, since the output_qk API can now only land once (here). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…SKILL.md cleanup) User requested folding PR microsoft#28371 entirely into this PR to eliminate the merge-ordering hazard between the two PRs and present a coherent fix+tests+docs reviewer story. Following the earlier d613966 migration of the output_qk plumbing and masked fp32 softcap ordering tests, this commit migrates the remaining microsoft#28371 content: test_gqa.py: - TestONNXAttentionGQAAsymmetricHeadSize (REG, 2 tests) — guards the silent-broken-output regression on GQA + asymmetric Q/V head sizes that was fixed by PR microsoft#28358 (microsoft#28357). Pins the post-fix unfused-path behaviour on fp16 + bf16. - TestONNXAttentionGQASoftcapFloat32 (1e baseline, 2 tests) — pins fp32 + softcap + GQA on the unfused path (MEA excludes is_gqa && fp32). Sibling to TestONNXAttentionGQASoftcapFloat32MaskOrdering already in this PR. SKILL.md (cuda-attention-kernel-patterns): - MEA eligibility paragraph: clarify that head_size%8 is enforced by has_memory_efficient_attention, and that head_size == v_head_size is required for GQA (LaunchUngroup) in addition to decode (LaunchConcatNewToPastKV). Reflects the post-microsoft#28358 host-side gate cleanup. After this commit, PR microsoft#28371 is fully superseded; the lead will close it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| qk_matmul_output_mode_ == attention_helper::QKMatMulOutputMode::kPostSoftCap || | ||
| qk_matmul_output_mode_ == attention_helper::QKMatMulOutputMode::kPostMaskBias || | ||
| qk_matmul_output_mode_ == attention_helper::QKMatMulOutputMode::kPostSoftMax, | ||
| "qk_matmul_output_mode must be 0, 1, 2, or 3."); |
| // these reference values are stable across both. Independent oracle | ||
| // verification: see lead-39245992/pr1v2-oracle-verify.py | ||
| // (max |y - oracle| ~5e-7, max |qk_matmul - oracle| ~6e-7, well inside | ||
| // fp16 tolerance). Pre-fix values were calibrated to the buggy | ||
| // bias-before-softcap ordering. |
Review (4-agent fan-out: code-reviewer + critical-reviewer + readability-reviewer + me)🔴 CriticalNone. 🟡 Major (worth addressing before merge)
🔵 Minor
↪️ Follow-up (out of scope, surfacing for tracking only)
✅ What's right
RecommendationApprove once Major #1 and Major #2 are addressed. Major #3 and #4 are wording/test-hygiene improvements. Minors are take-or-leave. |
Fix CPU Attention softcap/attn_mask ordering to match ONNX v24 spec (#7867, #7913) + consolidate CUDA spec coverage tests (#28371)
Branch:
fix-cpu-attention-softcap-mask-ordering(offmicrosoft/onnxruntimemain)Tip:
bfe33d3d08Scope
CPU spec-correct softcap+mask ordering fix (per Fix Attention op softcap ordering: apply before mask/bias onnx/onnx#7867) — the headline bug. The CPU
attention.ccwas applyingattn_maskbeforesoftcap(which would clip a-infmasked logit to a finite value, leaking the masked V slot through softmax). Now appliesscale → softcap → +attn_mask → softmaxin spec order.qk_matmul_output_modeenum value swap (per Swap qk_matmul_output_mode 1 & 2 to match computation order onnx/onnx#7913). Breaking change to mode1↔2semantics + matching C++ enumerator rename. CPU op fully implements the new ordering; CUDA op (Flash/MEA/Unfused dispatch) confirmed unchanged in behaviour.Tianlei blocker tests (CPU): mode-1+softcap differentiating, softcap+nonpad_kv_seqlen no-leakage. Both failed before this PR's fix.
Comprehensive CUDA Attention spec coverage (consolidated from Add CUDA Attention spec-coverage tests (GQA asymmetric head-size, output_qk, fp32 softcap+mask ordering) #28371):
TestONNXAttentionGQAAsymmetricHeadSize(REG) — guards the silent-broken-output regression on GQA + asymmetric Q/V head sizes (CUDA Attention kernel crashes with mismatched Q/K head dimensions (head_size != v_head_size) #28357, fixed by Fix CUDA Attention dispatch: skip MEA when head_size != v_head_size in GQA #28358).TestONNXAttentionGQAOutputQK(1c) — pins GQA + raw-QK output (qk_matmul_output_mode == kQK) on the unfused fp16 path.TestONNXAttentionGQASoftcapFloat32(1e baseline) +TestONNXAttentionGQASoftcapFloat32MaskOrdering(1e ordering) — pins fp32 + softcap on the unfused GQA path, including the post-Clean up CPU kernel definition for opset 13 Pad #7867 softcap+mask ordering using a poison-V leakage detector.SKILL.md updates (
.agents/skills/cuda-attention-kernel-patterns/SKILL.md):head_size%8is enforced byhas_memory_efficient_attention;head_size == v_head_sizeis required for GQA (LaunchUngroup) in addition to decode (LaunchConcatNewToPastKV). Reflects the post-Fix CUDA Attention dispatch: skip MEA when head_size != v_head_size in GQA #28358 host-side gate.ONNX node-test skip lists (transient, until ONNX v1.22+ is bundled):
onnxruntime/test/onnx/TestCase.cc::GetBrokenTests()— 20 attention entries.onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc::current_failing_tests— 7 CPU attention entries.Both block-comments cite Fix Attention op softcap ordering: apply before mask/bias onnx/onnx#7867 + Swap qk_matmul_output_mode 1 & 2 to match computation order onnx/onnx#7913 and the v1.22+ unblock criterion: the fixtures bundled in v1.21.0 were generated before the spec PRs landed and therefore encode the now-incorrect ordering. Once
cmake/external/onnxis bumped to ≥1.22, both skip blocks should be removed in one commit.Cross-references
Verification
AttentionTest.*— 60/60 pass (gtest, CPU + CUDA).TestONNXAttentionCPUSoftcapMaskOrdering— 4/4 pass (CPU spec fix + Tianlei blockers).TestONNXAttentionGQAAsymmetricHeadSize— 2/2 pass (fp16 + bf16, H100).TestONNXAttentionGQAOutputQK— 1/1 pass (fp16, H100).TestONNXAttentionGQASoftcapFloat32+…MaskOrdering— 4/4 pass (fp32, H100).lintrunnerclean on all touched files.