perf: Remove implicit CPU-GPU syncs due to implicit .item() call by Dhruv88 · Pull Request #42433 · huggingface/transformers

Dhruv88 · 2025-11-26T18:58:50Z

What does this PR do?

Remove unnecssary cudaStreamSynchronize by implicit call to .item()
Motivation mentioned in issue tagged below.

Made changes as proposed in the issue.

Heads up

When I ran the integration tests two of them were failing:
test_llama_3_1_hard
test_model_7b_logits_bf16

The issue was the actual output was not matching the expected output. I am using an A100 GPU.
For first case actual output was matching the expected output of rocm and not the cuda one.
In second case it was:
tensor([[-6.5081, -4.1175, -4.9761, -3.1678, 0.8199, -3.0029, 1.2809, -3.3309]] which was again differs a bit from the expected cuda outputs.

However, these tests are failing even without my change. Also, the output with and without my change is same. So, my belief is that my changes are not causing the failure.

Another point is that similar code appears across multiple files so it this one is as per expectation I can make changes in the other files as well. Please let me know how I should proceed.

@Rocketknight1 @ArthurZucker @Cyrilvallez

Rocketknight1 · 2025-11-27T14:06:10Z

Yes, those CI errors are not your fault! However, the check_repository_failure issue means that your change might need to be propagated to other files. Run make fix-copies and other models that are copying llama code will also update.

Dhruv88 · 2025-11-28T06:34:31Z

@Rocketknight1 Among the tests that failed, two of them look to be due to worker timeout or crash so rerunning should work. The third one is for modeling_colqwen2.py. However, no code in this file has been changed in my PR so it is unrelated to my changes. The test name is test_load_save_without_tied_weights which comes from test_modeling_common.py.

Anyways, after some investigation here is my understanding and possible fix for the issue.

The test sets config.tie_word_embeddings=False. However, there is no such key config. What actually should be set is config.vlm_config.tie_word_embeddings=False. Doing this instead passes the test since it correctly unties the weight in model.vlm.

Please let me know if my understanding is right. If yes, then I can try to make a quick fix by ensuring the correct key is set.

ArthurZucker

Very nice! Can you make sur compile with reduce overheads works as expected still? That's my only concern, otherwise LGTM!

Rocketknight1 · 2025-11-28T15:22:34Z

Colqwen2 issue should be fixed on main too, so a rebase should make that go away

Dhruv88 · 2025-11-29T15:13:44Z

Very nice! Can you make sur compile with reduce overheads works as expected still? That's my only concern, otherwise LGTM!

@ArthurZucker. I tried the below minimal code taken from official documentation here.
I made minor changes to it so that cache_position is not passed and the path with changed code is taken.
Putting the snippet in this gist for reference.

Note that the cache_position is not passed to compiled_model forward pass and the compiled decode_one_tokens.

With the updated code it runs and generates the ouput below:

['Simply put, the theory of relativity states that 1) the laws of physics are the same for all observers in uniform motion relative to one another and 2) the speed of light is always constant, regardless of the motion of the observer or the!', 'My favorite all time favorite condiment is ketchup. I love it on burgers, fries, scrambled eggs, and even as a dip for chicken tenders. I have tried many different brands of ketchup over the years, but my go-to is always!']

However, with the earlier code the torch.compile fails with a graph break:

Traceback (most recent call last):
  File "/home/dd/transformers/test.py", line 53, in <module>
    next_token = decode_one_tokens(model, next_token.clone(), None, past_key_values)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dd/miniconda3/envs/transformers/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 841, in compile_wrapper
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Data dependent operator
  Explanation: Operator `aten._local_scalar_dense.default` has a non-Tensor output whose value is dependent on the data of Tensor inputs.
  Hint: Enable tracing of data-dependent output operators with `torch._dynamo.config.capture_scalar_outputs = True`

  Developer debug context: aten._local_scalar_dense.default

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0033.html

from user code:
   File "/home/dd/transformers/test.py", line 20, in decode_one_tokens
    logits = model(
  File "/home/dd/transformers/src/transformers/utils/generic.py", line 764, in wrapper
    output = func(self, *args, **kwargs)
  File "/home/dd/transformers/src/transformers/models/llama/modeling_llama.py", line 489, in forward
    outputs: BaseModelOutputWithPast = self.model(
  File "/home/dd/transformers/src/transformers/utils/generic.py", line 919, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/home/dd/transformers/src/transformers/models/llama/modeling_llama.py", line 404, in forward
    cache_position: torch.Tensor = torch.arange(

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

To ensure that this code path is taken I also put a breakpoint before the line in the uncompiled version and ran in debugging mode. Can confirm that it indeed executes that line.

Let me know if you want me to check for anything else!

ArthurZucker · 2025-11-29T16:37:24Z

Sounds great!

HuggingFaceDocBuilderDev · 2025-11-29T16:45:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Dhruv88 · 2025-12-01T09:24:02Z

@ArthurZucker @Rocketknight1 I rebased a couple of times but the test_torch and test_tokenization always seem to fail due to worker crash and the test_processor due to timeout. Maybe just rerunning the failed tests might help but not sure how to do that. How can I understand the issue here?

ArthurZucker · 2025-12-01T10:05:14Z

Don't worry these are unrelated!

…r imagegpt

github-actions · 2025-12-01T12:28:08Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, arcee, aria, bitnet, cohere, csm, cwm, deepseek_v2, deepseek_v3, diffllama, emu3, ernie4_5, glm, glm4, glm4_moe, helium

ArthurZucker · 2025-12-01T14:02:01Z

Thanks for the PR!

…gingface#42433) * perf: Remove implicit CPU-GPU syncs due to implicit .item() call * fix: replicated the changes across similar files * fix: update the newly added nanochat model files * fix: use input_shape and device instead of input_emdeds properties for imagegpt

Dhruv88 force-pushed the avoid_gpu_cpu_sync branch from 9b1fbe0 to b7dc156 Compare November 27, 2025 19:05

ArthurZucker reviewed Nov 28, 2025

View reviewed changes

Dhruv88 force-pushed the avoid_gpu_cpu_sync branch from b7dc156 to e55398d Compare November 29, 2025 12:08

ArthurZucker marked this pull request as ready for review November 29, 2025 16:37

Dhruv88 force-pushed the avoid_gpu_cpu_sync branch 2 times, most recently from b770b2f to 7113d1c Compare December 1, 2025 09:08

Dhruv88 added 4 commits December 1, 2025 12:24

perf: Remove implicit CPU-GPU syncs due to implicit .item() call

182a84a

fix: replicated the changes across similar files

fb35512

fix: update the newly added nanochat model files

b6ec9d6

fix: use input_shape and device instead of input_emdeds properties fo…

780a281

…r imagegpt

Dhruv88 force-pushed the avoid_gpu_cpu_sync branch from 7113d1c to 780a281 Compare December 1, 2025 12:27

ArthurZucker merged commit a3881a8 into huggingface:main Dec 1, 2025
16 of 21 checks passed

Rocketknight1 mentioned this pull request Dec 1, 2025

Performance Improvement: Avoid host↔device synchronizations caused by tensor-to-Python conversions and certain tensor ops #42422

Open

Cyrilvallez mentioned this pull request Mar 11, 2026

Remove cache_position in more models #44330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Remove implicit CPU-GPU syncs due to implicit .item() call#42433

perf: Remove implicit CPU-GPU syncs due to implicit .item() call#42433
ArthurZucker merged 4 commits intohuggingface:mainfrom
Dhruv88:avoid_gpu_cpu_sync

Dhruv88 commented Nov 26, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Nov 27, 2025

Uh oh!

Dhruv88 commented Nov 28, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Rocketknight1 commented Nov 28, 2025

Uh oh!

Dhruv88 commented Nov 29, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Nov 29, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 29, 2025

Uh oh!

Dhruv88 commented Dec 1, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Dec 1, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Dec 1, 2025

Uh oh!

Uh oh!

ArthurZucker commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Dhruv88 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Heads up

Uh oh!

Rocketknight1 commented Nov 27, 2025

Uh oh!

Dhruv88 commented Nov 28, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Nov 28, 2025

Uh oh!

Dhruv88 commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Nov 29, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 29, 2025

Uh oh!

Dhruv88 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Dec 1, 2025

Uh oh!

Uh oh!

ArthurZucker commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dhruv88 commented Nov 26, 2025 •

edited

Loading

Dhruv88 commented Nov 29, 2025 •

edited

Loading

Dhruv88 commented Dec 1, 2025 •

edited

Loading

ArthurZucker commented Dec 1, 2025 •

edited

Loading