Skip to content

Fix many HPU failures in the CI#39066

Merged
SunMarc merged 14 commits intomainfrom
fix-hpu-errors
Jul 3, 2025
Merged

Fix many HPU failures in the CI#39066
SunMarc merged 14 commits intomainfrom
fix-hpu-errors

Conversation

@IlyasMoutawwakil
Copy link
Copy Markdown
Member

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@IlyasMoutawwakil IlyasMoutawwakil marked this pull request as ready for review June 27, 2025 15:36
@IlyasMoutawwakil IlyasMoutawwakil requested a review from ydshieh June 27, 2025 15:36
Comment thread utils/notification_service.py Outdated
Comment thread utils/notification_service.py
Comment thread tests/trainer/test_trainer.py
Comment thread src/transformers/utils/import_utils.py
Comment thread src/transformers/trainer.py Outdated
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, but the changes in trainer.py I don't have clear context to judge.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to remove the delay optimizer if we are sure this:

  • fixes
  • does not introduce breaking changes

Comment thread src/transformers/trainer.py Outdated
@IlyasMoutawwakil
Copy link
Copy Markdown
Member Author

@ArthurZucker removed the fsdp fix in favor of #39152 as it makes more sense to only prepare the model rather than the optimizer.

@IlyasMoutawwakil
Copy link
Copy Markdown
Member Author

added tracker for the HPU patches in #39175

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@SunMarc SunMarc merged commit 18e0cae into main Jul 3, 2025
26 checks passed
@SunMarc SunMarc deleted the fix-hpu-errors branch July 3, 2025 09:17
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants