Fix CB Accuracy Regression under FA2 by Qubitium · Pull Request #45274 · huggingface/transformers

Qubitium · 2026-04-07T01:43:12Z

What does this PR do?

Fix: CUDA graph reuse for FA2 continuous batching was wrongly keyed causing quality collapse for specific configuration

CUDA graph reuse used the wrong key: replay reuse depended on padded tensor sizes, but FA varlen kernels also depend on non-tensor runtime ints such as max_seqlen_q and max_seqlen_k. That allowed CB to replay a graph captured for one FA runtime shape against a different one, which is why max_batch_tokens could change accuracy.

Fix: remove the FA2 decode-fast-path gate that didn't make sense (likely related to above bug)

_ensure_decode_fast_path_is_available() only accepted FA3, but FA2/FA3 should both be supported.

I highly suspect, looking at the comments that installed the FA3 only gates that the first bug is the one might be source cause of the output variance that lead to the FA2 off-gating.

  MAIN = /tmp/transformers-main-head @ d3c7a19176
  PR = /root/transformers @ 33e9cbb08e
  Workload: Evalution GSM8K Platinum, paged|flash_attention_2, bf16, batch_size=24, max_rows=128, use_cuda_graph=auto, use_async_batching=auto.
  

  Sweep Results
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+
  | max_bt | MAIN acc,num | PR acc,num | Δ acc     | MAIN samp/s | PR samp/s | Δ samp/s | MAIN peak GB | PR peak GB | 
  +========+==============+============+===========+=============+===========+==========+==============+============+
    | 128    | 0.4140625    | 0.4140625  | +0.0000   | 9.7604      | 10.1033   | +3.51%   | 34.7090      | 34.7090    |
    | 256    | 0.4140625    | 0.4453125  | +0.0312   | 9.7349      | 8.7137    | -10.49%  | 34.9004      | 34.9004    |
--> | 384    | 0.2265625    | 0.4531250  | +0.2266   | 10.0055     | 9.9615    | -0.44%   | 35.0859      | 35.0898    |
    | 512    | 0.4218750    | 0.4062500  | -0.0156   | 10.0384     | 9.7630    | -2.74%   | 35.2949      | 35.2949    | 
    | 640    | 0.3984375    | 0.4140625  | +0.0156   | 10.2056     | 9.5572    | -6.35%   | 35.4883      | 35.4902    |
    | 768    | 0.3984375    | 0.4062500  | +0.0078   | 10.0023     | 9.7447    | -2.58%   | 35.6836      | 35.6836    | 
    | 896    | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        | 
    | 1024   | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        |
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+

  Failure Details
  +--------+-------------------------------+
  | max_bt | MAIN                          | PR                             |
  +========+===============================+================================+
  | 896    | illegal memory access         | illegal memory access          |
  | 1024   | CUBLAS_STATUS_EXECUTION_FAILED| illegal memory access          |
  +--------+-------------------------------+--------------------------------+

Look at the accuracy collapse at max_bt (max batched token size for cb) == 384!

The 896/1024 are caused by another un-related bug that I did not push to this PR so this one is clean/isolated.

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@remi-or @ArthurZucker @McPatate

This reverts commit a2a95fe.

remi-or · 2026-04-08T12:24:56Z

Hi @Qubitium ! Thanks for reporting this issue, which I totally overlooked. I was wondering if the fix shouldn't be to just pad max_seqlen_q to the number of q tokens, and add max_seqlen_k to the signature with a padding interval. Seems a bit simpler than the alernative. Also not sure about re-instating FA2 decode path -- i don't see how this fixes the issue I had when calling with_kvcache, because it does not use max_seqlen . wdyt?

Qubitium · 2026-04-09T03:46:17Z

Hi @Qubitium ! Thanks for reporting this issue, which I totally overlooked. I was wondering if the fix shouldn't be to just pad max_seqlen_q to the number of q tokens, and add max_seqlen_k to the signature with a padding interval. Seems a bit simpler than the alernative. Also not sure about re-instating FA2 decode path -- i don't see how this fixes the issue I had when calling with_kvcache, because it does not use max_seqlen . wdyt?

@remi-or I just checked your alternate PR and it is better in every way. Feel free to close this PR.

As far as the decode-fast-path for FA2 I didn't have visibility to why it was gated so I assumed the variance was caused by this. Can you double check again why FA2 kv cache has issues? Without the decode fast path, in my benchmarks the decode phase is 3x slower which is substantial. My local evals show that with it enabled, full model evals are not degrading in quality (A100), though the evals may not be hitting the kv cache much so not triggering the bug/variance that you noticed. It would be awesome if we can get FA2 enabled end to end since I don't have a FA3 compatible gpu.

remi-or · 2026-04-27T04:31:54Z

Closing, any further discussion on the topic of FA2 can happen on #45323 :)

Qubitium added 7 commits April 6, 2026 22:18

fix fa2 cb max_bt graph reuse

b07a530

fix async graph bounds fallback

a7c9044

guard fa2 async graphs

a2a95fe

Revert "guard fa2 async graphs"

c21271a

This reverts commit a2a95fe.

document graph bound replay

c74a4f8

simplify graph reuse fix

33e9cbb

enable fa2 decode fast path

9a9d4ed

Qubitium changed the title ~~Fix fa2 max bt~~ Fix CB Accuracy Regression under FA2 Apr 7, 2026

Qubitium added 4 commits April 7, 2026 09:45

Merge branch 'main' into fix-fa2-max-bt

1e9e7c1

restore cb max_blocks default

f28d929

restore cb config comment

45b1271

style

f921f65

remi-or mentioned this pull request Apr 8, 2026

[CB] Fix capture of max_seqlen #45323

Merged

remi-or closed this Apr 27, 2026

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CB Accuracy Regression under FA2#45274

Fix CB Accuracy Regression under FA2#45274
Qubitium wants to merge 11 commits intohuggingface:mainfrom
Qubitium:fix-fa2-max-bt

Qubitium commented Apr 7, 2026 •

edited

Loading

Uh oh!

remi-or commented Apr 8, 2026

Uh oh!

Qubitium commented Apr 9, 2026 •

edited

Loading

Uh oh!

remi-or commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qubitium commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

remi-or commented Apr 8, 2026

Uh oh!

Qubitium commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

remi-or commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qubitium commented Apr 7, 2026 •

edited

Loading

Qubitium commented Apr 9, 2026 •

edited

Loading