Skip to content

perf: Remove implicit CPU-GPU syncs due to implicit .item() call#42433

Merged
ArthurZucker merged 4 commits intohuggingface:mainfrom
Dhruv88:avoid_gpu_cpu_sync
Dec 1, 2025
Merged

perf: Remove implicit CPU-GPU syncs due to implicit .item() call#42433
ArthurZucker merged 4 commits intohuggingface:mainfrom
Dhruv88:avoid_gpu_cpu_sync

Conversation

@Dhruv88
Copy link
Copy Markdown
Contributor

@Dhruv88 Dhruv88 commented Nov 26, 2025

What does this PR do?

Remove unnecssary cudaStreamSynchronize by implicit call to .item()
Motivation mentioned in issue tagged below.

Fixes #42422

Made changes as proposed in the issue.

Heads up

When I ran the integration tests two of them were failing:
test_llama_3_1_hard
test_model_7b_logits_bf16

The issue was the actual output was not matching the expected output. I am using an A100 GPU.
For first case actual output was matching the expected output of rocm and not the cuda one.
In second case it was:
tensor([[-6.5081, -4.1175, -4.9761, -3.1678, 0.8199, -3.0029, 1.2809, -3.3309]] which was again differs a bit from the expected cuda outputs.

However, these tests are failing even without my change. Also, the output with and without my change is same. So, my belief is that my changes are not causing the failure.

Another point is that similar code appears across multiple files so it this one is as per expectation I can make changes in the other files as well. Please let me know how I should proceed.

@Rocketknight1 @ArthurZucker @Cyrilvallez

@Rocketknight1
Copy link
Copy Markdown
Member

Yes, those CI errors are not your fault! However, the check_repository_failure issue means that your change might need to be propagated to other files. Run make fix-copies and other models that are copying llama code will also update.

@Dhruv88
Copy link
Copy Markdown
Contributor Author

Dhruv88 commented Nov 28, 2025

@Rocketknight1 Among the tests that failed, two of them look to be due to worker timeout or crash so rerunning should work. The third one is for modeling_colqwen2.py. However, no code in this file has been changed in my PR so it is unrelated to my changes. The test name is test_load_save_without_tied_weights which comes from test_modeling_common.py.

Anyways, after some investigation here is my understanding and possible fix for the issue.

The test sets config.tie_word_embeddings=False. However, there is no such key config. What actually should be set is config.vlm_config.tie_word_embeddings=False. Doing this instead passes the test since it correctly unties the weight in model.vlm.

Please let me know if my understanding is right. If yes, then I can try to make a quick fix by ensuring the correct key is set.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Can you make sur compile with reduce overheads works as expected still? That's my only concern, otherwise LGTM!

@Rocketknight1
Copy link
Copy Markdown
Member

Colqwen2 issue should be fixed on main too, so a rebase should make that go away

@Dhruv88
Copy link
Copy Markdown
Contributor Author

Dhruv88 commented Nov 29, 2025

Very nice! Can you make sur compile with reduce overheads works as expected still? That's my only concern, otherwise LGTM!

@ArthurZucker. I tried the below minimal code taken from official documentation here.
I made minor changes to it so that cache_position is not passed and the path with changed code is taken.
Putting the snippet in this gist for reference.

Note that the cache_position is not passed to compiled_model forward pass and the compiled decode_one_tokens.

With the updated code it runs and generates the ouput below:

['Simply put, the theory of relativity states that 1) the laws of physics are the same for all observers in uniform motion relative to one another and 2) the speed of light is always constant, regardless of the motion of the observer or the!', 'My favorite all time favorite condiment is ketchup. I love it on burgers, fries, scrambled eggs, and even as a dip for chicken tenders. I have tried many different brands of ketchup over the years, but my go-to is always!']

However, with the earlier code the torch.compile fails with a graph break:

Traceback (most recent call last):
  File "/home/dd/transformers/test.py", line 53, in <module>
    next_token = decode_one_tokens(model, next_token.clone(), None, past_key_values)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dd/miniconda3/envs/transformers/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 841, in compile_wrapper
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Data dependent operator
  Explanation: Operator `aten._local_scalar_dense.default` has a non-Tensor output whose value is dependent on the data of Tensor inputs.
  Hint: Enable tracing of data-dependent output operators with `torch._dynamo.config.capture_scalar_outputs = True`

  Developer debug context: aten._local_scalar_dense.default

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0033.html

from user code:
   File "/home/dd/transformers/test.py", line 20, in decode_one_tokens
    logits = model(
  File "/home/dd/transformers/src/transformers/utils/generic.py", line 764, in wrapper
    output = func(self, *args, **kwargs)
  File "/home/dd/transformers/src/transformers/models/llama/modeling_llama.py", line 489, in forward
    outputs: BaseModelOutputWithPast = self.model(
  File "/home/dd/transformers/src/transformers/utils/generic.py", line 919, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/home/dd/transformers/src/transformers/models/llama/modeling_llama.py", line 404, in forward
    cache_position: torch.Tensor = torch.arange(

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

To ensure that this code path is taken I also put a breakpoint before the line in the uncompiled version and ran in debugging mode. Can confirm that it indeed executes that line.

Let me know if you want me to check for anything else!

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Sounds great!

@ArthurZucker ArthurZucker marked this pull request as ready for review November 29, 2025 16:37
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Dhruv88 Dhruv88 force-pushed the avoid_gpu_cpu_sync branch 2 times, most recently from b770b2f to 7113d1c Compare December 1, 2025 09:08
@Dhruv88
Copy link
Copy Markdown
Contributor Author

Dhruv88 commented Dec 1, 2025

@ArthurZucker @Rocketknight1 I rebased a couple of times but the test_torch and test_tokenization always seem to fail due to worker crash and the test_processor due to timeout. Maybe just rerunning the failed tests might help but not sure how to do that. How can I understand the issue here?

@ArthurZucker
Copy link
Copy Markdown
Collaborator

ArthurZucker commented Dec 1, 2025

Don't worry these are unrelated!

@Dhruv88 Dhruv88 force-pushed the avoid_gpu_cpu_sync branch from 7113d1c to 780a281 Compare December 1, 2025 12:27
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 1, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, arcee, aria, bitnet, cohere, csm, cwm, deepseek_v2, deepseek_v3, diffllama, emu3, ernie4_5, glm, glm4, glm4_moe, helium

@ArthurZucker ArthurZucker merged commit a3881a8 into huggingface:main Dec 1, 2025
16 of 21 checks passed
@ArthurZucker
Copy link
Copy Markdown
Collaborator

Thanks for the PR!

sarathc-cerebras pushed a commit to sarathc-cerebras/transformers that referenced this pull request Dec 7, 2025
…gingface#42433)

* perf: Remove implicit CPU-GPU syncs due to implicit .item() call

* fix: replicated the changes across similar files

* fix: update the newly added nanochat model files

* fix: use input_shape and device instead of input_emdeds properties for imagegpt
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
…gingface#42433)

* perf: Remove implicit CPU-GPU syncs due to implicit .item() call

* fix: replicated the changes across similar files

* fix: update the newly added nanochat model files

* fix: use input_shape and device instead of input_emdeds properties for imagegpt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance Improvement: Avoid host↔device synchronizations caused by tensor-to-Python conversions and certain tensor ops

4 participants