Fix: Enable prefill phase key value caching of nemotron/minitron models by jeongin601 · Pull Request #34742 · huggingface/transformers

jeongin601 · 2024-11-15T07:37:37Z

What does this PR do?

Problem

Current implementation does not enable key value caching of nemotron and minitron models.
This problem can be checked by a quick example code that generates key and value caches.

Modification

I modified the code to enabled key value caching while prefill phase, in reference to 'modeling_llama.py' file.

Key value caching example code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load Minitron model and tokenizer from Hugging Face
model_name = "your-minitron-model-name"  # Replace with the actual Minitron model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()

# Sample input text
input_text = "Hello, how are you?"

# Tokenize the input
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

# First forward pass (prefill phase)
with torch.no_grad():
    outputs = model(input_ids, use_cache=True)  # Set use_cache=True
    logits = outputs.logits
    past_key_values = outputs.past_key_values

# Check the output
print("Logits shape:", logits.shape)
print("Number of layers in past_key_values:", len(past_key_values))
print("Shape of keys and values in the first layer:")
print("Key shape:", past_key_values[0][0].shape)
print("Value shape:", past_key_values[0][1].shape)

# Add new input to test cache utilization
new_input_text = " What about you?"
new_input_ids = tokenizer(new_input_text, return_tensors="pt").input_ids

# Pass the new input along with the previous key-value cache
with torch.no_grad():
    outputs_with_cache = model(new_input_ids, past_key_values=past_key_values, use_cache=True)

# Check results after caching
new_logits = outputs_with_cache.logits
new_past_key_values = outputs_with_cache.past_key_values

print("New logits shape:", new_logits.shape)
print("Number of layers in new past_key_values:", len(new_past_key_values))

As-Is Result

Key value caching is not done.

To-be Result

Key value caching is enabled

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker
Can you please check my modification? :)

Signed-off-by: jeongin601 <0200angela@gmail.com>

LysandreJik · 2024-11-15T09:18:39Z

cc @ArthurZucker @gante @zucchini-nlp

zucchini-nlp

Thanks for adding this! Let's remove the deprecation warning, otherwise LGTM!

zucchini-nlp · 2024-11-19T10:20:04Z

+                logger.warning_once(
+                    "We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and "
+                    "will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class "
+                    "(https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)"
+                )


I don't think we need to add deprecation message for newly added models, we can support only new Cache objects

Thanks for reviewing my code! I removed the deprecation warning. :)

sorry if I wasn't clear, I mean totally removing support for the tuple format and this the from_legacy_cache

I'll ping the core maintainer after that for the final review :)

Oh, sorry I got it wrong. Now, I removed support for tuple shaped past_key_values. Is this what you meant?

Signed-off-by: jeongin601 <0200angela@gmail.com>

ArthurZucker

One suggestion and we can merge!

ArthurZucker · 2024-11-21T13:53:13Z

BTW seem to be related to #34274

HuggingFaceDocBuilderDev · 2024-11-21T14:21:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

jeongin601 · 2024-11-21T16:36:21Z

One suggestion and we can merge!

I updated it! :) Thanks

ydshieh · 2024-11-26T21:23:42Z

Hi @jeongin601, thank you for this PR ❤️ !

It seems this PR has some regression: below is the 3 tests failing. Would you be up to take a look? You can find the job run log here.

In any case, thank you in advance.

        "nemotron": {
            "single-gpu": [
                {
                    "test": "tests/models/nemotron/test_modeling_nemotron.py::NemotronModelTest::test_torchscript_output_attentions",
                    "commit": "318fe25f22a99ce1226f8d2aadc268b40f7e55af",
                    "pr_number": 34742,
                    "author": "jeongin601",
                    "merged_by": "zucchini-nlp"
                },
                {
                    "test": "tests/models/nemotron/test_modeling_nemotron.py::NemotronModelTest::test_torchscript_output_hidden_state",
                    "commit": "318fe25f22a99ce1226f8d2aadc268b40f7e55af",
                    "pr_number": 34742,
                    "author": "jeongin601",
                    "merged_by": "zucchini-nlp"
                },
                {
                    "test": "tests/models/nemotron/test_modeling_nemotron.py::NemotronModelTest::test_torchscript_simple",
                    "commit": "318fe25f22a99ce1226f8d2aadc268b40f7e55af",
                    "pr_number": 34742,
                    "author": "jeongin601",
                    "merged_by": "zucchini-nlp"
                }
            ]
        }
    },

ydshieh · 2024-11-26T21:24:55Z

@ArthurZucker @zucchini-nlp A kind remind: don't hesitate to ask for slow CI 🙂 - let's use the tools we have to make our life easier🙏

ArthurZucker · 2024-11-28T14:24:14Z

Ah it makes sense, torchscript does not support DynamicCache class !

ArthurZucker · 2024-11-28T14:24:20Z

(AFAIR)

ydshieh · 2024-11-28T14:38:05Z

ah ok. I will check than. But if we eventually move forward to DynamicCache and drop legacy cache, it would mean torchscript is not going to work for many models ..?

ArthurZucker · 2024-11-28T14:57:09Z

Yeah 👀 unless used with optimum!

…ls (huggingface#34742) * modeling nemotron kv caching bugfix Signed-off-by: jeongin601 <0200angela@gmail.com> * test file deleted Signed-off-by: jeongin601 <0200angela@gmail.com> * code refinement Signed-off-by: jeongin601 <0200angela@gmail.com> * remove unused variables Signed-off-by: jeongin601 <0200angela@gmail.com> * import block sorted * removed deprecation warning Signed-off-by: jeongin601 <0200angela@gmail.com> * removed support for tuple shape past_key_values Signed-off-by: jeongin601 <0200angela@gmail.com> * Update conditional statement for cache initialization Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Signed-off-by: jeongin601 <0200angela@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

jeongin601 added 5 commits November 15, 2024 07:14

modeling nemotron kv caching bugfix

06a520c

Signed-off-by: jeongin601 <0200angela@gmail.com>

test file deleted

b8a562d

Signed-off-by: jeongin601 <0200angela@gmail.com>

code refinement

a0e5df3

Signed-off-by: jeongin601 <0200angela@gmail.com>

remove unused variables

affc4d4

Signed-off-by: jeongin601 <0200angela@gmail.com>

import block sorted

56ed98c

zucchini-nlp reviewed Nov 19, 2024

View reviewed changes

jeongin601 added 2 commits November 20, 2024 05:15

removed deprecation warning

617fb7e

Signed-off-by: jeongin601 <0200angela@gmail.com>

removed support for tuple shape past_key_values

c663ec8

Signed-off-by: jeongin601 <0200angela@gmail.com>

zucchini-nlp requested a review from ArthurZucker November 20, 2024 08:52

ArthurZucker approved these changes Nov 21, 2024

View reviewed changes

Comment thread src/transformers/models/nemotron/modeling_nemotron.py Outdated

Update conditional statement for cache initialization

2eb30af

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'main' into main

aa0552a

zucchini-nlp merged commit 318fe25 into huggingface:main Nov 25, 2024

Conversation

jeongin601 commented Nov 15, 2024 • edited by ArthurZucker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Problem

Modification

Key value caching example code

As-Is Result

To-be Result

Before submitting

Who can review?

Uh oh!

LysandreJik commented Nov 15, 2024

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

jeongin601 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

jeongin601 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker commented Nov 21, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2024

Uh oh!

jeongin601 commented Nov 21, 2024

Uh oh!

ydshieh commented Nov 26, 2024

Uh oh!

ydshieh commented Nov 26, 2024

Uh oh!

ArthurZucker commented Nov 28, 2024

Uh oh!

ArthurZucker commented Nov 28, 2024

Uh oh!

ydshieh commented Nov 28, 2024

Uh oh!

ArthurZucker commented Nov 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jeongin601 commented Nov 15, 2024 •

edited by ArthurZucker

Loading