Skip to content

Fix CB Accuracy Regression under FA2#45274

Closed
Qubitium wants to merge 11 commits intohuggingface:mainfrom
Qubitium:fix-fa2-max-bt
Closed

Fix CB Accuracy Regression under FA2#45274
Qubitium wants to merge 11 commits intohuggingface:mainfrom
Qubitium:fix-fa2-max-bt

Conversation

@Qubitium
Copy link
Copy Markdown
Contributor

@Qubitium Qubitium commented Apr 7, 2026

What does this PR do?

  1. Fix: CUDA graph reuse for FA2 continuous batching was wrongly keyed causing quality collapse for specific configuration

CUDA graph reuse used the wrong key: replay reuse depended on padded tensor sizes, but FA varlen kernels also depend on non-tensor runtime ints such as max_seqlen_q and max_seqlen_k. That allowed CB to replay a graph captured for one FA runtime shape against a different one, which is why max_batch_tokens could change accuracy.

  1. Fix: remove the FA2 decode-fast-path gate that didn't make sense (likely related to above bug)

_ensure_decode_fast_path_is_available() only accepted FA3, but FA2/FA3 should both be supported.

I highly suspect, looking at the comments that installed the FA3 only gates that the first bug is the one might be source cause of the output variance that lead to the FA2 off-gating.

  MAIN = /tmp/transformers-main-head @ d3c7a19176
  PR = /root/transformers @ 33e9cbb08e
  Workload: Evalution GSM8K Platinum, paged|flash_attention_2, bf16, batch_size=24, max_rows=128, use_cuda_graph=auto, use_async_batching=auto.
  

  Sweep Results
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+
  | max_bt | MAIN acc,num | PR acc,num | Δ acc     | MAIN samp/s | PR samp/s | Δ samp/s | MAIN peak GB | PR peak GB | 
  +========+==============+============+===========+=============+===========+==========+==============+============+
    | 128    | 0.4140625    | 0.4140625  | +0.0000   | 9.7604      | 10.1033   | +3.51%   | 34.7090      | 34.7090    |
    | 256    | 0.4140625    | 0.4453125  | +0.0312   | 9.7349      | 8.7137    | -10.49%  | 34.9004      | 34.9004    |
--> | 384    | 0.2265625    | 0.4531250  | +0.2266   | 10.0055     | 9.9615    | -0.44%   | 35.0859      | 35.0898    |
    | 512    | 0.4218750    | 0.4062500  | -0.0156   | 10.0384     | 9.7630    | -2.74%   | 35.2949      | 35.2949    | 
    | 640    | 0.3984375    | 0.4140625  | +0.0156   | 10.2056     | 9.5572    | -6.35%   | 35.4883      | 35.4902    |
    | 768    | 0.3984375    | 0.4062500  | +0.0078   | 10.0023     | 9.7447    | -2.58%   | 35.6836      | 35.6836    | 
    | 896    | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        | 
    | 1024   | FAIL         | FAIL       | n/a       | n/a         | n/a       | n/a      | n/a          | n/a        |
  +--------+--------------+------------+-----------+-------------+-----------+----------+--------------+------------+

  Failure Details
  +--------+-------------------------------+
  | max_bt | MAIN                          | PR                             |
  +========+===============================+================================+
  | 896    | illegal memory access         | illegal memory access          |
  | 1024   | CUBLAS_STATUS_EXECUTION_FAILED| illegal memory access          |
  +--------+-------------------------------+--------------------------------+

Look at the accuracy collapse at max_bt (max batched token size for cb) == 384!

The 896/1024 are caused by another un-related bug that I did not push to this PR so this one is clean/isolated.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@remi-or @ArthurZucker @McPatate

@Qubitium Qubitium changed the title Fix fa2 max bt Fix CB Accuracy Regression under FA2 Apr 7, 2026
@remi-or
Copy link
Copy Markdown
Collaborator

remi-or commented Apr 8, 2026

Hi @Qubitium ! Thanks for reporting this issue, which I totally overlooked. I was wondering if the fix shouldn't be to just pad max_seqlen_q to the number of q tokens, and add max_seqlen_k to the signature with a padding interval. Seems a bit simpler than the alernative. Also not sure about re-instating FA2 decode path -- i don't see how this fixes the issue I had when calling with_kvcache, because it does not use max_seqlen . wdyt?

@Qubitium
Copy link
Copy Markdown
Contributor Author

Qubitium commented Apr 9, 2026

Hi @Qubitium ! Thanks for reporting this issue, which I totally overlooked. I was wondering if the fix shouldn't be to just pad max_seqlen_q to the number of q tokens, and add max_seqlen_k to the signature with a padding interval. Seems a bit simpler than the alernative. Also not sure about re-instating FA2 decode path -- i don't see how this fixes the issue I had when calling with_kvcache, because it does not use max_seqlen . wdyt?

@remi-or I just checked your alternate PR and it is better in every way. Feel free to close this PR.

As far as the decode-fast-path for FA2 I didn't have visibility to why it was gated so I assumed the variance was caused by this. Can you double check again why FA2 kv cache has issues? Without the decode fast path, in my benchmarks the decode phase is 3x slower which is substantial. My local evals show that with it enabled, full model evals are not degrading in quality (A100), though the evals may not be hitting the kv cache much so not triggering the bug/variance that you noticed. It would be awesome if we can get FA2 enabled end to end since I don't have a FA3 compatible gpu.

@remi-or
Copy link
Copy Markdown
Collaborator

remi-or commented Apr 27, 2026

Closing, any further discussion on the topic of FA2 can happen on #45323 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants