Generate: speculative decoding by gante · Pull Request #27979 · huggingface/transformers

gante · 2023-12-12T15:57:32Z

What does this PR do?

Useful context:
In a recent PR (#27750), the candidate generation in assisted generation got abstracted, so we can host new candidate generation techniques (such as #27722).

This PR:

Reworks assisted candidate generation to call .generate(), instead of having its own custom generation loop. For most models this is nothing more than a nice abstraction. However, for models with a custom generate() function, this means the assistant model will now make use of it! (🤔 does this mean that DistilWhisper gets better numbers with this refactor?) Edit: moved to Generate: assisted decoding now uses generate for the assistant #28030
Adds speculative decoding (paper, see Algorithm 1). This implied a minor interface change in the candidate generation class, which should be okay since it hasn't been released :)

The following tests were run locally and are passing:

RUN_SLOW=1 py.test tests/models/whisper/ -k speculative
py.test tests/ -k test_assisted (which now triggers speculative decoding)

TODO:

Benchmark speculative decoding

gante · 2023-12-12T17:13:21Z

@patrickvonplaten tagging you here for a 2nd set of eyes on the speculative decoding method (changes in utils.py), which I'm assuming you're familiar with. Feel free to delegate to someone else who is familiar with the method! 🤗

gante · 2023-12-12T17:14:25Z

These are not the best variable names, but it's hard to compare against the original algorithm if they don't match 🤔 As such, I've decided to keep the original names

I'm fine with it as there's good comments and other variables are well names e.g. is_rejected :)

amyeroberts · 2023-12-13T14:03:28Z

Thanks for adding this! Can we split this up into two separate PRs: one changing the assisted generation and the other adding speculative decoding?

gante · 2023-12-14T09:58:18Z

@amyeroberts pulled the assisted generation changes into this PR: #28030

After it is merged, I will rebase this one and ping you again -- this one will become exclusively about speculative decoding 🤗

gante · 2023-12-14T14:18:32Z

@amyeroberts I've rerun the slow tests, and I can confirm they are passing. Ready for a review :)

amyeroberts

Thanks for adding this!

Can we add some tests, in particular one which checks case 1. and one which makes sure the correct logic branch is being selected e.g. checking candidate_logits is None when expected (might be a test on the candidate generator instead)?

amyeroberts · 2023-12-15T17:19:35Z

I'm fine with it as there's good comments and other variables are well names e.g. is_rejected :)

patrickvonplaten · 2023-12-15T20:39:56Z

+                if do_sample:
+                    probs = new_logits.softmax(dim=-1)
+                    selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]
+                else:
+                    selected_tokens = new_logits.argmax(dim=-1)


Suggested change

if do_sample:

probs = new_logits.softmax(dim=-1)

selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]

else:

selected_tokens = new_logits.argmax(dim=-1)

if do_sample:

probs = new_logits.softmax(dim=-1)

selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]

else:

selected_tokens = new_logits.argmax(dim=-1)

It's probably time to soon factor this out into something like:

selected_tokens = Categorical(new_logits / temperature).sample()

everywhere in generate

Yes! Then equivalent sampling/non-sampling methods (e.g. greedy decoding/samplinh) could be merged into a single function, facilitating maintenance. I'm going to leave it to a follow-up PR, though, to keep this PR exclusively about speculative decoding.

patrickvonplaten · 2023-12-15T20:45:45Z

            else:
-                selected_tokens = new_logits.argmax(dim=-1)
+                if do_sample:
+                    probs = new_logits.softmax(dim=-1)


is this case still relevant? Not sure it's a good idea to have two "assisted decoding" do_sample=True cases in our generate. Should we maybe just deprecate this case?

patrickvonplaten

Super cool addition!

Not really related to this PR, but I feel like we should start putting all the generation submethods (assisted decoding, greedy & sample (guess we can merge these two), beam search, ...) into their own files by now

My only important comment here is that I don't think it's great that we have 2 assisted generation cases now where do_sample=True. Can we deprecate the "non-official" one?

gante · 2023-12-17T19:38:29Z

@patrickvonplaten the two types of sampling are needed :D

New candidate-based methods are popping up (e.g. #27775), and they don't necessarily have logits. As such, speculative decoding, which needs the candidates' logits, can't be applied to those methods.

patrickvonplaten · 2023-12-18T10:11:08Z

@patrickvonplaten the two types of sampling are needed :D

New candidate-based methods are popping up (e.g. #27775), and they don't necessarily have logits. As such, speculative decoding, which needs the candidates' logits, can't be applied to those methods.

But shouldn't they just be the "own" method now? I.e. I don't think we should put #27775 into the speculative decoding method no?

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

gante · 2023-12-18T10:53:37Z

@patrickvonplaten #27775 does not introduce changes to assisted generation 🤗 In #28030 I've abstracted the candidate generation part of assisted generation. We now load candidate generators the same way as we load the logits processors:

transformers/src/transformers/generation/utils.py

Lines 899 to 919 in e6dcf8a

    
               def _get_candidate_generator( 
        
                   self, 
        
                   generation_config: GenerationConfig, 
        
                   input_ids: torch.LongTensor, 
        
                   inputs_tensor: torch.Tensor, 
        
                   assistant_model: "PreTrainedModel", 
        
                   logits_processor: LogitsProcessorList, 
        
                   model_kwargs: Dict, 
        
               ) -> CandidateGenerator: 
        
                   """ 
        
                   Returns the candidate generator to be used in `assisted_generation` 
        
                   """ 
        
                   candidate_generator = AssistedCandidateGenerator( 
        
                       input_ids=input_ids, 
        
                       assistant_model=assistant_model, 
        
                       logits_processor=logits_processor, 
        
                       model_kwargs=model_kwargs, 
        
                       inputs_tensor=inputs_tensor, 
        
                       eos_token_id=generation_config.eos_token_id, 
        
                   ) 
        
                   return candidate_generator

In assisted generation, we call the candidate generator to get candidate sequences (which may or may not contain associated logits, depending on the method)

transformers/src/transformers/generation/utils.py

Line 4588 in e6dcf8a

    
           candidate_input_ids, candidate_logits = candidate_generator.get_candidates(input_ids)

The technique in #27775 can thus be added by adding a new candidate generator in _get_candidate_generator. Other candidate generators may be added the same way, enabling users to experiment with the concept of candidates!

Because needing the logits (for speculative decoding) is a very limiting constraint, I'd rather keep the two sampling paths.

HuggingFaceDocBuilderDev · 2023-12-18T11:07:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2023-12-18T11:56:52Z

@amyeroberts PR comments addressed 🤗

@patrickvonplaten Unless you don't strongly oppose, I'd like to keep the two sampling paths, for the reasons I've written here -- I think it will be beneficial in the long run! :) (otherwise, a whole new generation method has to be written for #27775)

gante · 2023-12-18T15:57:19Z

@amyeroberts -- @patrickvonplaten and I had a chat about whether to keep the two sampling paths or not. For context, here's what we agreed on:

It's okay to leave it as is, and perhaps abstract the different ways we accept candidates into a candidate_checker block.
Be conservative on adding new candidate generators, so we don't end up with unused methods
[in a follow-up PR] squash other cases where the decoding method is the same except for the token selection, like greedy_decoding + sample
[in a follow-up PR] mode each decoding method into its own file. There are several private functions in generation/utils.py that are exclusively used with one generation method.

amyeroberts

Thanks for iterating!

jmamou · 2024-01-18T17:01:24Z

@gante
According to experiments reported in Leviathan's paper, speculative decoding (SD) has higher speedup with greedy decoding (temp=0). However, in the current implementation, SD works only with do_sample=True.

gante · 2024-01-19T11:13:01Z

@jmamou speculative decoding with do_sample=False (or temp=0) was already encoded in assisted_generation, before this PR -- try calling model.generate(input_ids, do_sample=False, assistant_model=assistant_model) :)

jmamou · 2024-01-21T10:21:37Z

@gante
Since acceptance criteria are different between speculative decoding and assisted generation, I think that it would be great to be able to run both speculative decoding and assisted generation with no sampling.

jmamou · 2024-01-21T12:39:58Z

@gante
I implemented it. I can submit a PR.

jmamou · 2024-01-24T13:45:29Z

@gante
In previous implementation of assisted generation (4.33) with heuristical update of num_assistant_tokens (or max_assistant_tokens), the value of num_assistant_tokens was preserved between 2 consecutive generate() calls.

In current implementation (4.38), num_assistant_tokens is updated by the candidate_generator during the generation but assistant_model.generation_config.num_assistant_tokens is not updated at the end of the generation. Therefore, next call to generate will start with the initial value of assistant_model.generation_config.num_assistant_tokens (5).

Is it intentional? If that's a bug, I can open a PR to fix it.

gante · 2024-01-27T16:11:58Z

@jmamou

Since acceptance criteria are different between speculative decoding and assisted generation, I think that it would be great to be able to run both speculative decoding and assisted generation with no sampling.

Not sure if this is a good idea

if we see greedy decoding as applying temperature=0, the model probability will be 1 at the most likely token and 0 everywhere else. In turn, this implies that p_i/q_i is >=1 at all positions, and thus all candidate tokens would be accepted 👉 speculative decoding would be the same as simply using the assistant model
If we don't apply temperature=0, then it would be sampling -- in other words, it wouldn't be greedy decoding

In previous implementation of assisted generation (4.33) with heuristical update of num_assistant_tokens (or max_assistant_tokens), the value of num_assistant_tokens was preserved between 2 consecutive generate() calls.
In current implementation (4.38), num_assistant_tokens is updated by the candidate_generator during the generation but assistant_model.generation_config.num_assistant_tokens is not updated at the end of the generation. Therefore, next call to generate will start with the initial value of assistant_model.generation_config.num_assistant_tokens (5).
Is it intentional? If that's a bug, I can open a PR to fix it.

This is a good point! A PR to revert to the previous behaviour (with a test) would be appreciated 🙏

gante force-pushed the candidate_generate_refactor branch from 18a4eda to 993c9ee Compare December 12, 2023 17:08

gante requested review from amyeroberts and patrickvonplaten December 12, 2023 17:10

gante marked this pull request as ready for review December 12, 2023 17:11

gante commented Dec 12, 2023

View reviewed changes

gante mentioned this pull request Dec 14, 2023

Generate: assisted decoding now uses generate for the assistant #28030

Merged

gante added 4 commits December 14, 2023 14:03

speculative decoding

8dbb065

fix test

a726936

space

7e4deab

better comments

e234e1e

gante force-pushed the candidate_generate_refactor branch from 7bf05a9 to e234e1e Compare December 14, 2023 14:03

gante added 2 commits December 14, 2023 14:04

remove redundant test

b4dab21

test nit

f2f99f3

amyeroberts reviewed Dec 15, 2023

View reviewed changes

patrickvonplaten reviewed Dec 15, 2023

View reviewed changes

Comment thread src/transformers/generation/utils.py

patrickvonplaten reviewed Dec 15, 2023

View reviewed changes

Comment thread src/transformers/generation/utils.py Outdated

patrickvonplaten reviewed Dec 15, 2023

View reviewed changes

Apply suggestions from code review

64c59a5

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

PR comments

c7f1d12

patrickvonplaten approved these changes Dec 18, 2023

View reviewed changes

amyeroberts approved these changes Dec 18, 2023

View reviewed changes

gante merged commit ac97419 into huggingface:main Dec 19, 2023

gante deleted the candidate_generate_refactor branch January 9, 2024 16:06

wasertech mentioned this pull request Jan 17, 2024

Add support for prompt-lookup speculative decoding vllm-project/vllm#2469

Closed

jmamou mentioned this pull request Jan 29, 2024

fix num_assistant_tokens with heuristic schedule #28759

Merged

5 tasks

Conversation

gante commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gante commented Dec 12, 2023

Uh oh!

gante Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

amyeroberts Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

amyeroberts commented Dec 13, 2023

Uh oh!

gante commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten Dec 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante commented Dec 17, 2023

Uh oh!

patrickvonplaten commented Dec 18, 2023

Uh oh!

gante commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 18, 2023

Uh oh!

gante commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

jmamou commented Jan 18, 2024

Uh oh!

gante commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmamou commented Jan 21, 2024

Uh oh!

jmamou commented Jan 21, 2024

Uh oh!

jmamou commented Jan 24, 2024

Uh oh!

gante commented Jan 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

gante commented Dec 12, 2023 •

edited

Loading

gante commented Dec 14, 2023 •

edited

Loading

gante commented Dec 14, 2023 •

edited

Loading

patrickvonplaten Dec 15, 2023 •

edited

Loading

patrickvonplaten left a comment •

edited

Loading

gante commented Dec 18, 2023 •

edited

Loading

gante commented Dec 18, 2023 •

edited

Loading

gante commented Dec 18, 2023 •

edited

Loading

gante commented Jan 19, 2024 •

edited

Loading

gante commented Jan 27, 2024 •

edited

Loading