Fix many HPU failures in the CI by IlyasMoutawwakil · Pull Request #39066 · huggingface/transformers

IlyasMoutawwakil · 2025-06-26T16:16:17Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…TopP and TopK are used together, which ends up killing beams early.

HuggingFaceDocBuilderDev · 2025-06-26T16:35:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ydshieh

Works for me, but the changes in trainer.py I don't have clear context to judge.

ArthurZucker

Happy to remove the delay optimizer if we are sure this:

fixes
does not introduce breaking changes

IlyasMoutawwakil · 2025-07-01T20:14:50Z

@ArthurZucker removed the fsdp fix in favor of #39152 as it makes more sense to only prepare the model rather than the optimizer.

IlyasMoutawwakil · 2025-07-02T14:44:32Z

added tracker for the HPU patches in #39175

SunMarc

Thanks !

* more torch.hpu patches * increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early. * remove temporal fix * fix scatter operation when input and src are the same * trigger * fix and reduce * skip finding batch size as it makes the hpu go loco * fix fsdp (yay all are passing) * fix checking equal nan values * style * remove models list * order * rename to cuda_extensions * Update src/transformers/trainer.py

IlyasMoutawwakil added 5 commits June 26, 2025 14:36

more torch.hpu patches

9a93570

increase top_k because it results in flaky behavior when Tempreture, …

732144d

…TopP and TopK are used together, which ends up killing beams early.

remove temporal fix

d02275b

fix scatter operation when input and src are the same

dd7e0c7

trigger

9aae39d

IlyasMoutawwakil and others added 6 commits June 27, 2025 10:45

fix and reduce

d37ffc7

skip finding batch size as it makes the hpu go loco

8990c35

fix fsdp (yay all are passing)

0655de4

fix checking equal nan values

f30b437

style

b74508b

remove models list

ed97465

IlyasMoutawwakil marked this pull request as ready for review June 27, 2025 15:36

IlyasMoutawwakil requested a review from ydshieh June 27, 2025 15:36

ydshieh reviewed Jun 27, 2025

View reviewed changes

Comment thread utils/notification_service.py Outdated

ydshieh reviewed Jun 27, 2025

View reviewed changes

Comment thread utils/notification_service.py

IlyasMoutawwakil added 2 commits June 27, 2025 21:35

order

daaec7b

rename to cuda_extensions

2db94c0

ydshieh reviewed Jun 30, 2025

View reviewed changes

Comment thread tests/trainer/test_trainer.py

ydshieh reviewed Jun 30, 2025

View reviewed changes

Comment thread src/transformers/utils/import_utils.py

ydshieh reviewed Jun 30, 2025

View reviewed changes

Comment thread src/transformers/trainer.py Outdated

ydshieh approved these changes Jun 30, 2025

View reviewed changes

ArthurZucker approved these changes Jul 1, 2025

View reviewed changes

Comment thread src/transformers/trainer.py Outdated

Update src/transformers/trainer.py

6ee666c

SunMarc approved these changes Jul 3, 2025

View reviewed changes

SunMarc merged commit 18e0cae into main Jul 3, 2025
26 checks passed

SunMarc deleted the fix-hpu-errors branch July 3, 2025 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix many HPU failures in the CI#39066

Fix many HPU failures in the CI#39066
SunMarc merged 14 commits intomainfrom
fix-hpu-errors

IlyasMoutawwakil commented Jun 26, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ydshieh left a comment

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

IlyasMoutawwakil commented Jul 1, 2025

Uh oh!

IlyasMoutawwakil commented Jul 2, 2025

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

IlyasMoutawwakil commented Jun 26, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IlyasMoutawwakil commented Jul 1, 2025

Uh oh!

IlyasMoutawwakil commented Jul 2, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants