perf: Remove implicit CPU-GPU syncs due to implicit .item() call#42433
perf: Remove implicit CPU-GPU syncs due to implicit .item() call#42433ArthurZucker merged 4 commits intohuggingface:mainfrom
Conversation
|
Yes, those CI errors are not your fault! However, the |
9b1fbe0 to
b7dc156
Compare
|
@Rocketknight1 Among the tests that failed, two of them look to be due to worker timeout or crash so rerunning should work. The third one is for Anyways, after some investigation here is my understanding and possible fix for the issue. The test sets Please let me know if my understanding is right. If yes, then I can try to make a quick fix by ensuring the correct key is set. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Very nice! Can you make sur compile with reduce overheads works as expected still? That's my only concern, otherwise LGTM!
|
Colqwen2 issue should be fixed on main too, so a rebase should make that go away |
b7dc156 to
e55398d
Compare
@ArthurZucker. I tried the below minimal code taken from official documentation here. Note that the With the updated code it runs and generates the ouput below: However, with the earlier code the torch.compile fails with a graph break: To ensure that this code path is taken I also put a breakpoint before the line in the uncompiled version and ran in debugging mode. Can confirm that it indeed executes that line. Let me know if you want me to check for anything else! |
|
Sounds great! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
b770b2f to
7113d1c
Compare
|
@ArthurZucker @Rocketknight1 I rebased a couple of times but the test_torch and test_tokenization always seem to fail due to worker crash and the test_processor due to timeout. Maybe just rerunning the failed tests might help but not sure how to do that. How can I understand the issue here? |
|
Don't worry these are unrelated! |
7113d1c to
780a281
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: apertus, arcee, aria, bitnet, cohere, csm, cwm, deepseek_v2, deepseek_v3, diffllama, emu3, ernie4_5, glm, glm4, glm4_moe, helium |
|
Thanks for the PR! |
…gingface#42433) * perf: Remove implicit CPU-GPU syncs due to implicit .item() call * fix: replicated the changes across similar files * fix: update the newly added nanochat model files * fix: use input_shape and device instead of input_emdeds properties for imagegpt
…gingface#42433) * perf: Remove implicit CPU-GPU syncs due to implicit .item() call * fix: replicated the changes across similar files * fix: update the newly added nanochat model files * fix: use input_shape and device instead of input_emdeds properties for imagegpt
What does this PR do?
Remove unnecssary cudaStreamSynchronize by implicit call to .item()
Motivation mentioned in issue tagged below.
Fixes #42422
Made changes as proposed in the issue.
Heads up
When I ran the integration tests two of them were failing:
test_llama_3_1_hard
test_model_7b_logits_bf16
The issue was the actual output was not matching the expected output. I am using an A100 GPU.
For first case actual output was matching the expected output of rocm and not the cuda one.
In second case it was:
tensor([[-6.5081, -4.1175, -4.9761, -3.1678, 0.8199, -3.0029, 1.2809, -3.3309]] which was again differs a bit from the expected cuda outputs.
However, these tests are failing even without my change. Also, the output with and without my change is same. So, my belief is that my changes are not causing the failure.
Another point is that similar code appears across multiple files so it this one is as per expectation I can make changes in the other files as well. Please let me know how I should proceed.
@Rocketknight1 @ArthurZucker @Cyrilvallez