HPU support by IlyasMoutawwakil · Pull Request #36424 · huggingface/transformers

IlyasMoutawwakil · 2025-02-26T14:28:45Z

What does this PR do?

This PR introduces upstream support for HPU torch device/backend:

HPU is the device name for Intel Gaudi Accelerators, a very powerful, energy efficient ASIC for AI workloads.
Gaudi1 is available on AWS since 2021, Gaudi2/Gaudi3 on Intel Dev Cloud and soon on IBM Cloud.
The documentation of the torch device is available here.

This PR focuses on enabling out of the box support in eager mode (PT_HPU_LAZY_MODE=0), while optimum-habana will continue to enable optimized paths making use of the lazy mode and advanced features of the SynapseAI software stack.

This is part of three PRs:

Safetensors support hpu:0 safetensors/safetensors#578
Accelerate HPU support accelerate#3378
Transformers HPU support #36424 (this one)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-02-26T15:08:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil · 2025-03-05T10:19:03Z

@ArthurZucker @muellerzr PR is ready for review, I made sure (trainer, fsdp, deepspeed) tests ran successfully on both gaudi1 and gaudi2 in single and multi device settings.

IlyasMoutawwakil · 2025-03-05T10:37:04Z

+        # the file doesn't exist in the repo
+        if not os.path.exists("utils/testing_scripts/fsdp_cpu_offloading.py"):
+            raise unittest.SkipTest("FSDP CPU offloading script not found!")


couldn't find this file, is this test still relevant ?

no idea cc @muellerzr

I think it's meant to be:

from functools import partial import torch from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import Accelerator # verify we have FSDP activation support ready by importing: from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import ( checkpoint_wrapper, CheckpointImpl, apply_activation_checkpointing, ) from transformers.models.llama.modeling_llama import LlamaDecoderLayer model_id = "HuggingFaceM4/tiny-random-Llama3ForCausalLM" model = AutoModelForCausalLM.from_pretrained(model_id) model.train() model.gradient_checkpointing_enable() accelerator = Accelerator() model = accelerator.prepare(model) check_fn = lambda submodule: isinstance(submodule, LlamaDecoderLayer) non_reentrant_wrapper = partial( checkpoint_wrapper, offload_to_cpu=False, checkpoint_impl=CheckpointImpl.NO_REENTRANT, ) apply_activation_checkpointing( model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn ) print(model) rand_input = torch.LongTensor([[0, 1, 0, 1]]).to(0) model(rand_input)

Was referenced in #31161 but never actually added? 😅

should I leave it for another PR ? the file path utils/testing_scripts/fsdp_cpu_offloading.py doesn't make sense in transformers repo.

ArthurZucker

NIce! Missing for me a bit of doc on:

what is HPU
how could anyone run on HPU?
But that's it!

ArthurZucker · 2025-03-05T10:54:24Z

+        # the file doesn't exist in the repo
+        if not os.path.exists("utils/testing_scripts/fsdp_cpu_offloading.py"):
+            raise unittest.SkipTest("FSDP CPU offloading script not found!")


no idea cc @muellerzr

muellerzr

Thanks! Added a note for our apparent missing test file 👀

muellerzr · 2025-03-06T09:32:36Z

+        # the file doesn't exist in the repo
+        if not os.path.exists("utils/testing_scripts/fsdp_cpu_offloading.py"):
+            raise unittest.SkipTest("FSDP CPU offloading script not found!")


I think it's meant to be:

from functools import partial import torch from transformers import AutoModelForCausalLM, AutoTokenizer from accelerate import Accelerator # verify we have FSDP activation support ready by importing: from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import ( checkpoint_wrapper, CheckpointImpl, apply_activation_checkpointing, ) from transformers.models.llama.modeling_llama import LlamaDecoderLayer model_id = "HuggingFaceM4/tiny-random-Llama3ForCausalLM" model = AutoModelForCausalLM.from_pretrained(model_id) model.train() model.gradient_checkpointing_enable() accelerator = Accelerator() model = accelerator.prepare(model) check_fn = lambda submodule: isinstance(submodule, LlamaDecoderLayer) non_reentrant_wrapper = partial( checkpoint_wrapper, offload_to_cpu=False, checkpoint_impl=CheckpointImpl.NO_REENTRANT, ) apply_activation_checkpointing( model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn ) print(model) rand_input = torch.LongTensor([[0, 1, 0, 1]]).to(0) model(rand_input)

Was referenced in #31161 but never actually added? 😅

ArthurZucker

Let's go!

muellerzr

Everything looks good from the Trainer side in my eyes, only thing we may want is to add an accelerate import check to flag as a requirement (release will go live tonight)

…st a bit more

IlyasMoutawwakil · 2025-03-11T20:35:04Z

only thing we may want is to add an accelerate import check to flag

Added ! target version is 1.50 right ? @muellerzr

IlyasMoutawwakil added 2 commits February 26, 2025 14:26

test

17f0bbc

fix

1c4ecc8

IlyasMoutawwakil and others added 7 commits February 26, 2025 17:13

fix

ff05ac3

Merge branch 'main' into hpu-support

cd9d5f3

Merge branch 'main' into hpu-support

272146c

Merge branch 'main' into hpu-support

504ea39

skip some and run some first

9c80ba0

test fsdp

43fe5f2

fix

4165c5a

IlyasMoutawwakil marked this pull request as draft February 27, 2025 16:03

IlyasMoutawwakil and others added 19 commits February 28, 2025 10:54

patches for generate

cd8b529

test distributed

053e967

Merge branch 'main' into hpu-support

8bd3aa6

copy

10daeb6

don't test distributed loss for hpu

9e79eb5

require fp16 and run first

2ca56b6

Merge branch 'main' into hpu-support

23fca81

changes from marc's PR fixing zero3

c1f8d83

better alternative

1a369e3

Merge branch 'main' into hpu-support

b92d04f

return True when fp16 support on gaudi without creating bridge

4302873

fix

f48ca4d

fix tested dtype in deepspeed inference test

25382bd

test

079f26d

fix

ff2937d

test

1e3009f

fix

1a06873

skip

0885868

require fp16

f4622f7