Support for a new Granite-Speech-Plus model#45695
Conversation
…eech_plus.py From review by eustlb Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py From a review by eustlb Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py From a review by eustlb Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py From a review by eustlb Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py From a review by eustlb Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py From a review by eustlb Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
- Remove unused imports - Add docs for params - Remove encode post_init to general post_init otherwise it doesn't work
| from ..granite_speech.test_modeling_granite_speech import ( | ||
| GraniteSpeechForConditionalGenerationModelTest as _GraniteSpeechModelTestBase, | ||
| ) | ||
| from ..granite_speech.test_modeling_granite_speech import ( | ||
| GraniteSpeechForConditionalGenerationModelTester as _GraniteSpeechModelTesterBase, | ||
| ) |
There was a problem hiding this comment.
curious to get your opinion on this @ArthurZucker
There was a problem hiding this comment.
yep its fine we use to have inheritance before the LLMTester
| — is inherited unchanged from Granite Speech. See the [Granite Speech documentation](./granite_speech) for usage | ||
| examples; the same [`GraniteSpeechProcessor`] and [`GraniteSpeechFeatureExtractor`] are used here. | ||
|
|
||
| ## GraniteSpeechPlusConfig |
There was a problem hiding this comment.
absolutely necessary to add a usage section, like for granite speech
| chat = [ | ||
| { | ||
| "role": "system", | ||
| "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant", | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": "<|audio|> can you transcribe the speech into a written format?", | ||
| }, | ||
| ] |
There was a problem hiding this comment.
let's update thehub repo chat template so that
- we have a default system prompt
- we don't have to put <|audio|> manually
see this example
There was a problem hiding this comment.
I prefer not to change this at this point because it will require large changes in code, testing and docs.
It would be better to do this later.
| cat_hidden_layers (`list[int]`, *optional*): | ||
| Indices of encoder conformer layers whose outputs are concatenated with the final encoder | ||
| output (along the feature dimension) before being passed to the projector. When set, the | ||
| projector's ``encoder_hidden_size`` must equal | ||
| ``encoder_config.hidden_dim * (len(cat_hidden_layers) + 1)``. | ||
|
|
There was a problem hiding this comment.
default is None - no hidden layers added
| from ..granite_speech.test_modeling_granite_speech import ( | ||
| GraniteSpeechForConditionalGenerationModelTest as _GraniteSpeechModelTestBase, | ||
| ) | ||
| from ..granite_speech.test_modeling_granite_speech import ( | ||
| GraniteSpeechForConditionalGenerationModelTester as _GraniteSpeechModelTesterBase, | ||
| ) |
There was a problem hiding this comment.
yep its fine we use to have inheritance before the LLMTester
| extra = {"prefix_text": prefix_text} if prefix_text is not None else {} | ||
| prompt_text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra) | ||
| inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device) |
There was a problem hiding this comment.
we should be able to do processor.apply_chat_template directly
Doc change suggestion from eustlb Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
run-slow: granite_speech_plus |
|
This comment contains models: ["models/granite_speech_plus"] |
CI ResultsCommit Info
Model CI Report❌ 2 new failed tests from this PR 😭
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, granite_speech_plus |
eustlb
left a comment
There was a problem hiding this comment.
Ran the slow test on the runners manually since the model is not released yet, all clear ✅
| chat = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}] | ||
| extra = {"prefix_text": prefix_text} if prefix_text is not None else {} | ||
| prompt_text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra) | ||
| inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device) | ||
| outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1) | ||
| new_tokens = outputs[0, inputs["input_ids"].shape[-1]:] | ||
| output_text = processor.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True) | ||
| return output_text |
There was a problem hiding this comment.
this should be update to this in a follow up PR!
| chat = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}] | |
| extra = {"prefix_text": prefix_text} if prefix_text is not None else {} | |
| prompt_text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra) | |
| inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device) | |
| outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1) | |
| new_tokens = outputs[0, inputs["input_ids"].shape[-1]:] | |
| output_text = processor.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True) | |
| return output_text | |
| conversation = [ | |
| {"role": "system", "content": SYSTEM_PROMPT}, | |
| {"role": "user", "content": [ | |
| {"type": "audio", "audio": audio.numpy()}, | |
| {"type": "text", "text": prompt}, | |
| ]}, | |
| ] | |
| extra = {"prefix_text": prefix_text} if prefix_text is not None else {} | |
| inputs = processor.apply_chat_template( | |
| conversation, | |
| tokenize=True, | |
| add_generation_prompt=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| **extra, | |
| ).to(device) | |
| outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1) | |
| new_tokens = outputs[0, inputs["input_ids"].shape[-1]:] | |
| return processor.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True) |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
New Granite-Speech-Plus model.
This should replace #44408 and #45512 following the review and suggestions of @eustlb
Changes:
Design choices:
Before submitting
This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
audio models: @eustlb @ebezzam @vasqu