Fixing RotaryEmbedding.forward to return float16 values in float16 precision mode. by kikutakou · Pull Request #24262 · huggingface/transformers

kikutakou · 2023-06-14T01:00:28Z

What does this PR do?

RotaryEmbedding.forward() returns values with float32 precision even in float16 precision mode.
This affects to the subsequent calculation and takes extra GPU memory usage.
This PR fixes that problem.

Fixes #24261

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

amyeroberts

Thanks for fixing!

Just a comment on the creation of the embeddings.

amyeroberts · 2023-06-14T15:23:13Z

I think we might also want to control the type when creating the weights e.g. like here for Llama

cc @younesbelkada who know's more about this

amyeroberts · 2023-06-14T15:25:34Z

nit: you can set dtype and device in one to call

Suggested change

cos = self.cos_cached[:seq_len, ...].to(x.device).to(x.dtype)

sin = self.sin_cached[:seq_len, ...].to(x.device).to(x.dtype)

cos = self.cos_cached[:seq_len, ...].to(x.device, dtype=x.dtype)

sin = self.sin_cached[:seq_len, ...].to(x.device, dtype=x.dtype)

Thanks for the comment! It's reflected to the patch!

ArthurZucker

As mentioned by @amyeroberts, I believe that the issue is rather with the initialization since calling model.half() will probably do the operation in float16. This means that the initialisation with torch.float16 as an argument of from_pretrained is not really doing it's job. I would be more in favor of fixing the init rather than changing the forward!

ArthurZucker · 2023-07-20T10:59:21Z

I will investigate whether or not this is the source of instabilities in Llama2! If so, will adresse it

ArthurZucker · 2023-08-16T06:54:12Z

No time to deep dive into this at the moment! If someone wants to check this feel free to do so! 😉

kikutakou · 2023-08-23T02:38:24Z

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
-        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))


float() always turns a tensor to float32. This is why initialisation with dtype=float16 did'n work.

kikutakou · 2023-08-23T02:41:50Z

    def _set_cos_sin_cache(self, seq_len, device):
        self.max_seq_len_cached = seq_len
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
+        t = torch.arange(self.max_seq_len_cached, device=device).float()


Since emb.cos() and emb.sin() at line 314 can only be calculated float32 on CPU, this variable must be float32.

If this t is float16 and emb.cos() is calculated on CPU, the following error will be raised:

RuntimeError: "cos_vml_cpu" not implemented for 'Half'

kikutakou · 2023-08-23T06:42:31Z

@ArthurZucker

Thanks for the comment.

The initialisation with torch.float16 as an argument of from_pretrained is not really doing it's job.

I've investigated and changed the patch to fix this issue.
Could you have a look at this patch?

from_pretrained changes torch default_dtype to the specified dtype, then initialize all weights.
GPTNeoXRotaryEmbedding.__init__() calls float() which always returns float32 even when default dtype is float16.
This was the reason.

ArthurZucker · 2023-10-16T06:16:06Z

This was actually fixed by #25830 !

github-actions · 2023-11-09T08:06:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kikutakou mentioned this pull request Jun 14, 2023

GPTNeoXAttention takes extra GPU memory footprint in torch.float16 precision mode. #24261

Closed

4 tasks

amyeroberts reviewed Jun 14, 2023

View reviewed changes

ArthurZucker reviewed Jun 19, 2023

View reviewed changes

huggingface deleted a comment from github-actions Bot Jul 20, 2023

huggingface deleted a comment from github-actions Bot Aug 16, 2023

kikutakou force-pushed the ko_gptneox_fp16_fix branch from 25a6a94 to aced0ab Compare August 18, 2023 04:03

inv_freq

3b4944b

kikutakou force-pushed the ko_gptneox_fp16_fix branch from aced0ab to 3b4944b Compare August 23, 2023 02:34

kikutakou commented Aug 23, 2023

View reviewed changes

fix copy code

184bfa9

ArthurZucker mentioned this pull request Aug 24, 2023

LlamaRotaryEmbedding (wrong cache value when casting model to float16/bfloat16) #25681

Closed

4 tasks

huggingface deleted a comment from github-actions Bot Sep 18, 2023

huggingface deleted a comment from github-actions Bot Oct 13, 2023

github-actions Bot closed this Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing RotaryEmbedding.forward to return float16 values in float16 precision mode.#24262

Fixing RotaryEmbedding.forward to return float16 values in float16 precision mode.#24262
kikutakou wants to merge 2 commits intohuggingface:mainfrom
kikutakou:ko_gptneox_fp16_fix

kikutakou commented Jun 14, 2023

Uh oh!

amyeroberts left a comment

Uh oh!

amyeroberts Jun 14, 2023

Uh oh!

amyeroberts Jun 14, 2023

Uh oh!

kikutakou Aug 18, 2023

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker commented Jul 20, 2023

Uh oh!

ArthurZucker commented Aug 16, 2023

Uh oh!

kikutakou Aug 23, 2023

Uh oh!

kikutakou Aug 23, 2023

Uh oh!

kikutakou commented Aug 23, 2023

Uh oh!

ArthurZucker commented Oct 16, 2023

Uh oh!

github-actions Bot commented Nov 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        cos = self.cos_cached[:seq_len, ...].to(x.device).to(x.dtype)
-        sin = self.sin_cached[:seq_len, ...].to(x.device).to(x.dtype)
+        cos = self.cos_cached[:seq_len, ...].to(x.device, dtype=x.dtype)
+        sin = self.sin_cached[:seq_len, ...].to(x.device, dtype=x.dtype)

Conversation

kikutakou commented Jun 14, 2023

What does this PR do?

Before submitting

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

amyeroberts Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

kikutakou Aug 18, 2023

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Jul 20, 2023

Uh oh!

ArthurZucker commented Aug 16, 2023

Uh oh!

kikutakou Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

kikutakou Aug 23, 2023

Choose a reason for hiding this comment

Uh oh!

kikutakou commented Aug 23, 2023

Uh oh!

ArthurZucker commented Oct 16, 2023

Uh oh!

github-actions Bot commented Nov 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants