fix static cache data type miss-match by jiqing-feng · Pull Request #34799 · huggingface/transformers

jiqing-feng · 2024-11-19T07:30:13Z

Hi @SunMarc . This PR fixed the data type mismatch when using low-precision static cache. The following code can reproduce the bug:

import torch
from transformers import pipeline

model_id = "EleutherAI/gpt-j-6b"
model_kwargs = {"torch_dtype": torch.bfloat16}

pipe = pipeline("text-generation", model=model_id, model_kwargs=model_kwargs)

generation_config = pipe.model.generation_config
generation_config.cache_implementation="static"

print(pipe("I am happy because", generation_config=generation_config))

Output:

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Traceback (most recent call last):
  File "/home/jiqingfe/test_gptj.py", line 13, in <module>
    print(pipe("I am happy because", generation_config=generation_config))
  File "/home/jiqingfe/transformers/src/transformers/pipelines/text_generation.py", line 272, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/jiqingfe/transformers/src/transformers/pipelines/base.py", line 1301, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/jiqingfe/transformers/src/transformers/pipelines/base.py", line 1308, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/jiqingfe/transformers/src/transformers/pipelines/base.py", line 1208, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/jiqingfe/transformers/src/transformers/pipelines/text_generation.py", line 370, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/jiqingfe/transformers/src/transformers/generation/utils.py", line 2263, in generate
    result = self._beam_search(
  File "/home/jiqingfe/transformers/src/transformers/generation/utils.py", line 3472, in _beam_search
    outputs = self(**model_inputs, return_dict=True)
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqingfe/transformers/src/transformers/models/gptj/modeling_gptj.py", line 1098, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqingfe/transformers/src/transformers/models/gptj/modeling_gptj.py", line 838, in forward
    outputs = block(
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqingfe/transformers/src/transformers/models/gptj/modeling_gptj.py", line 453, in forward
    attn_outputs = self.attn(
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/idp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jiqingfe/transformers/src/transformers/models/gptj/modeling_gptj.py", line 246, in forward
    key, value = layer_past.update(key, value, self.layer_idx, cache_kwargs)
  File "/home/jiqingfe/transformers/src/transformers/cache_utils.py", line 1220, in update
    k_out.index_copy_(2, cache_position, key_states)
RuntimeError: index_copy_(): self and source expected to have the same dtype, but got (self) BFloat16 and (source) Float

jiqing-feng · 2024-11-19T07:34:56Z

BTW, I suppose transformers missed some static cache tests, do you have any instructions about where can I add this kind of test? Thanks!

SunMarc · 2024-11-19T13:32:50Z

BTW, I suppose transformers missed some static cache tests, do you have any instructions about where can I add this kind of test? Thanks!

All the tests related to the cache are in test_utils.py file. inside GenerationTesterMixin, you will find the test we perform on all models and in GenerationIntegrationTests, these are integration tests.

SunMarc

Thanks for the bug fix ! Left a comment

SunMarc · 2024-11-19T13:40:29Z

+        key = key.permute(0, 2, 1, 3).to(value.dtype)
+        query = query.permute(0, 2, 1, 3).to(value.dtype)


could you explain why this is needed for this particular model and why this doesn't happen for llama for example ? Many models have approximately the same modeling code.

For llama, we can see sin and cos come from position_embeddings (bf16 tensor) which comes from here. You can see llama's rotary embedding converts the data type. But for gptj, the position embeddings come from here, it set the data type to float32, so data type miss-match happens when the input data type is bf16 or fp16.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2024-11-20T07:08:53Z

Hi @SunMarc . I left the comment to explain why llama model doesn't have this issue. BTW, I also added the low-precision static cache tests to avoid this kind of issue in the future, please review it. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc

Thanks for the explanation ! LGTM ! Did you run the static cache tests you added to see if there are other models that requires this fix ?

ArthurZucker

➕ on marc's comment. The safe way to do this is to cast key and query to the cache's dtype no? And do this in the cache_utils rather than at the modeling level!

jiqing-feng · 2024-11-21T02:04:21Z

➕ on marc's comment. The safe way to do this is to cast key and query to the cache's dtype no? And do this in the cache_utils rather than at the modeling level!

Yes, I have applied your suggestions, thanks!

Thanks for the explanation ! LGTM ! Did you run the static cache tests you added to see if there are other models that requires this fix ?

The CI already contains the tests that I changed, so currently no other models require it. Besides, I have changed it into cache_utils which should be applied for all language models with static cache.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

HuggingFaceDocBuilderDev · 2024-11-21T13:21:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2024-11-25T08:00:27Z

Hi @SunMarc , please review the new changes, thanks!

SunMarc · 2024-11-25T15:20:11Z

Hi @SunMarc , please review the new changes, thanks!

All good from my side. Pinging @ArthurZucker

ArthurZucker

Not completely sure we want to test for float32 as it's quite heavy

SunMarc · 2024-11-25T15:45:49Z

I think it was testing for float32 initally and @jiqing-feng added coverage for float16 @ArthurZucker

ArthurZucker · 2024-11-25T15:59:34Z

Sounds good then, merging!˜

* fix gptj data type missmatch Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * add low precision static cache tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix low-precision static cache tests * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * avoid config change Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * change data type convert in cache copy Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix comment Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * cast key value after k v out Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc reviewed Nov 19, 2024

View reviewed changes

jiqing-feng added 2 commits November 19, 2024 15:20

fix gptj data type missmatch

46d42a9

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into gptj

a13def9

jiqing-feng changed the title ~~fix gptj data type missmatch~~ fix static cache data type missmatch Nov 20, 2024

jiqing-feng changed the title ~~fix static cache data type missmatch~~ fix static cache data type miss-match Nov 20, 2024

jiqing-feng added 6 commits November 20, 2024 09:45

add low precision static cache tests

abdfc2e

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

4a5c0c6

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix low-precision static cache tests

3d95f69

fix format

6a90556

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

avoid config change

1576f8b

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into gptj

c14d36b

SunMarc approved these changes Nov 20, 2024

View reviewed changes

SunMarc requested a review from ArthurZucker November 20, 2024 12:30

ArthurZucker reviewed Nov 20, 2024

View reviewed changes

jiqing-feng added 2 commits November 21, 2024 09:54

change data type convert in cache copy

0ce90be

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into gptj

8bada77

SunMarc reviewed Nov 21, 2024

View reviewed changes

Comment thread src/transformers/cache_utils.py Outdated

jiqing-feng added 2 commits November 22, 2024 08:54

fix comment

82d1c1c

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

cast key value after k v out

3aae20e

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc requested a review from ArthurZucker November 22, 2024 16:17

Merge branch 'main' into gptj

76d5ca3

Merge branch 'main' into gptj

c65261e

ArthurZucker mentioned this pull request Nov 25, 2024

Multi-GPU setup: indices should be either on cpu or on the same device as the indexed tensor (cuda:1) #33147

Closed

4 tasks

ArthurZucker approved these changes Nov 25, 2024

View reviewed changes

ArthurZucker merged commit a464afb into huggingface:main Nov 25, 2024

jiqing-feng deleted the gptj branch November 26, 2024 01:06

gante mentioned this pull request May 13, 2025

New cache tests and modular Hybrid Cache #37972

Merged

		key = key.permute(0, 2, 1, 3).to(value.dtype)
		query = query.permute(0, 2, 1, 3).to(value.dtype)

Conversation

jiqing-feng commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 19, 2024

Uh oh!

SunMarc commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

jiqing-feng Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Nov 20, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2024

Uh oh!

jiqing-feng commented Nov 25, 2024

Uh oh!

SunMarc commented Nov 25, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Nov 25, 2024

Uh oh!

ArthurZucker commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiqing-feng commented Nov 19, 2024 •

edited

Loading

SunMarc commented Nov 19, 2024 •

edited

Loading

jiqing-feng commented Nov 21, 2024 •

edited

Loading