Skip to content

Support for a new Granite-Speech-Plus model#45695

Merged
eustlb merged 19 commits intohuggingface:mainfrom
zvik:granite_speech_plus
Apr 29, 2026
Merged

Support for a new Granite-Speech-Plus model#45695
eustlb merged 19 commits intohuggingface:mainfrom
zvik:granite_speech_plus

Conversation

@zvik
Copy link
Copy Markdown

@zvik zvik commented Apr 29, 2026

What does this PR do?

New Granite-Speech-Plus model.

This should replace #44408 and #45512 following the review and suggestions of @eustlb

Changes:

  • Add new Granite-Speech-Plus model. This is similar to the Granite-Speech model with support for the encoder to output additional internal state.
  • New configuration parameter for the encoder: cat_hidden_layers with optional list for internal layers
  • Encoder modified to output additional information
  • Model modified to validate parameters
  • Add corresponding tests

Design choices:

  • We decided not to use the output_capturing tool because it doesn't work well in this case (See [OutputRecorder] re.search on layer_name #45512)
  • Tests for the encoder new parameter are performed in the model configuration post_init because when I tired to add a post_init to the encoder configuration it required super().__post_init() but this cause the modular_model_converter to fail.
  • I confirm that this is not a pure code agent PR.

Before submitting

eustlb and others added 14 commits April 23, 2026 18:35
…eech_plus.py


From review by eustlb

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py


From a review by eustlb

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py


From a review by eustlb

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py


From a review by eustlb

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py


From a review by eustlb

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…eech_plus.py


From a review by eustlb

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
- Remove unused imports
- Add docs for params
- Remove encode post_init to general post_init otherwise it doesn't work
Comment on lines +25 to +30
from ..granite_speech.test_modeling_granite_speech import (
GraniteSpeechForConditionalGenerationModelTest as _GraniteSpeechModelTestBase,
)
from ..granite_speech.test_modeling_granite_speech import (
GraniteSpeechForConditionalGenerationModelTester as _GraniteSpeechModelTesterBase,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious to get your opinion on this @ArthurZucker

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep its fine we use to have inheritance before the LLMTester

— is inherited unchanged from Granite Speech. See the [Granite Speech documentation](./granite_speech) for usage
examples; the same [`GraniteSpeechProcessor`] and [`GraniteSpeechFeatureExtractor`] are used here.

## GraniteSpeechPlusConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely necessary to add a usage section, like for granite speech

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9a0c130

Comment on lines +199 to +208
chat = [
{
"role": "system",
"content": "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
},
{
"role": "user",
"content": "<|audio|> can you transcribe the speech into a written format?",
},
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update thehub repo chat template so that

  1. we have a default system prompt
  2. we don't have to put <|audio|> manually

see this example

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to change this at this point because it will require large changes in code, testing and docs.

It would be better to do this later.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +49 to +54
cat_hidden_layers (`list[int]`, *optional*):
Indices of encoder conformer layers whose outputs are concatenated with the final encoder
output (along the feature dimension) before being passed to the projector. When set, the
projector's ``encoder_hidden_size`` must equal
``encoder_config.hidden_dim * (len(cat_hidden_layers) + 1)``.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no defaults?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is None - no hidden layers added

Comment on lines +25 to +30
from ..granite_speech.test_modeling_granite_speech import (
GraniteSpeechForConditionalGenerationModelTest as _GraniteSpeechModelTestBase,
)
from ..granite_speech.test_modeling_granite_speech import (
GraniteSpeechForConditionalGenerationModelTester as _GraniteSpeechModelTesterBase,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep its fine we use to have inheritance before the LLMTester

Comment thread docs/source/en/model_doc/granite_speech_plus.md Outdated
Comment on lines +78 to +80
extra = {"prefix_text": prefix_text} if prefix_text is not None else {}
prompt_text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra)
inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to do processor.apply_chat_template directly

zvik and others added 3 commits April 29, 2026 13:29
Doc change suggestion from eustlb

Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eustlb
Copy link
Copy Markdown
Contributor

eustlb commented Apr 29, 2026

run-slow: granite_speech_plus

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/granite_speech_plus"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 896cf332 workflow commit (merge commit)
PR 0a046dcf branch commit (from PR)
main 1ca0be50 base commit (on main)

Model CI Report

2 new failed tests from this PR 😭

  • granite_speech_plus:
    tests/models/granite_speech_plus/test_modeling_granite_speech_plus.py::GraniteSpeechPlusForConditionalGenerationIntegrationTest::test_small_model_integration_test_batch (✅ ⟹ ❌)
    tests/models/granite_speech_plus/test_modeling_granite_speech_plus.py::GraniteSpeechPlusForConditionalGenerationIntegrationTest::test_small_model_integration_test_single (✅ ⟹ ❌)

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, granite_speech_plus

Copy link
Copy Markdown
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the slow test on the runners manually since the model is not released yet, all clear ✅

Comment on lines +70 to +77
chat = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}]
extra = {"prefix_text": prefix_text} if prefix_text is not None else {}
prompt_text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra)
inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1)
new_tokens = outputs[0, inputs["input_ids"].shape[-1]:]
output_text = processor.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True)
return output_text
Copy link
Copy Markdown
Contributor

@eustlb eustlb Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be update to this in a follow up PR!

Suggested change
chat = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}]
extra = {"prefix_text": prefix_text} if prefix_text is not None else {}
prompt_text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra)
inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1)
new_tokens = outputs[0, inputs["input_ids"].shape[-1]:]
output_text = processor.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True)
return output_text
conversation = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "audio", "audio": audio.numpy()},
{"type": "text", "text": prompt},
]},
]
extra = {"prefix_text": prefix_text} if prefix_text is not None else {}
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
**extra,
).to(device)
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1)
new_tokens = outputs[0, inputs["input_ids"].shape[-1]:]
return processor.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True)

@eustlb eustlb enabled auto-merge April 29, 2026 14:22
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eustlb eustlb added this pull request to the merge queue Apr 29, 2026
Merged via the queue into huggingface:main with commit a8f43ec Apr 29, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants