Skip to content

Add option to export encoder hidden states for Granite-speech#44408

Open
zvik wants to merge 13 commits intohuggingface:mainfrom
zvik:zk_granitespeech_enc_hidden_states
Open

Add option to export encoder hidden states for Granite-speech#44408
zvik wants to merge 13 commits intohuggingface:mainfrom
zvik:zk_granitespeech_enc_hidden_states

Conversation

@zvik
Copy link
Copy Markdown

@zvik zvik commented Mar 3, 2026

What does this PR do?

This PR allows the Granite-speech model to use hidden states from the encoder hidden layers.

This is an internal model option that is required for the next generation of Granite-speech models.

Changes:

  • New config parameter: encoder_hidden_layers: list[int] | None # e.g., [6, 12, 18]
  • Validation for the parameter values
  • Concatenation of the required hidden layers to the final output of the encoder

We chose to add a new parameter instead of using the output_hidden_states option for the following reasons:

  • The hidden layers from the encoder are only used internally and are not needed outside of the model.
  • The output_hidden_states return all the layers instead of just those that are needed.
  • In our case, we only need a small number of hidden layers. Returning all of them with the output_hidden_states flag can increase the memory footprint, especially for large audio data. This can impact the allowed decoding batch size.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

(@gsaon @avihu111 )

zvik and others added 12 commits March 1, 2026 16:04
New configuration option `encoder_hidden_layer` allow to pass hidden layers from the encoder to the projector.
Call was failing when output_hidden_states was set in kwargs (failed unit tests)
Add a test to verify that the size of the encoder output time the number of concatenated layers matches the size of the projector input.
Change import for fail check in PR huggingface#44408
Failed ruff check for PR huggingface#44408
Copy link
Copy Markdown
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zvik, thanks a lot for raising this PR!

Interesting approach, and I do get the motivations that are totally relevant. We might want to go for a less custom, more standardised approach here that would leverage the @capture_outputs flag, cc @ArthurZucker

Also, if this feature isn't intended for this specific model, we'll refrain from merging it here. We'd rather have it directly with the new granite-speech models, as this would better align with the lib's philosophy. You can definitely use this branch to prototype with it, though! Also, we'd be glad to help with day-0 integration of the new models, I'll send a message in our common slack channel to discuss it 🤗

Copy link
Copy Markdown
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing that the layers we'd want to pass to the projector will be fixed with the model to be released. I'd rather have it therefore hardcoded directly like this, but the same can also be done using a config parameter if it is necessary

Comment on lines 287 to 290
_can_record_outputs = {
"hidden_states": GraniteSpeechConformerBlock,
"attentions": GraniteSpeechConformerAttention,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_can_record_outputs = {
"hidden_states": [
OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.0"),
OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1"),
OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.2"),
],
"attentions": GraniteSpeechConformerAttention,
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eustlb , I'll check this method and see if it can be used for what we need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvik, are you in our colab slack channel ?

Copy link
Copy Markdown
Author

@zvik zvik Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, can you send a link to join? @eustlb

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eustlb I tried looking at your suggested direction. However, there seems to be a problem with the implementation of the capture_outputs. If for example you use OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1") this will match layer.1 but also layer.10, layer.11, layer.12 ...
As far as I can see, the problem is with the condition at

if specs.layer_name is not None and specs.layer_name not in module_name:
and I can think of a simple method to bypass this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?

@ArthurZucker As far as I can see, regex are not supported:

if specs.layer_name is not None and specs.layer_name not in module_name:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fairly easy to add no? Seems like a good feature

Resolve conflict in configuration_granite_speech.py: adopt main's @strict/@auto_docstring
dataclass style while preserving encoder_hidden_layers field and its validation logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: granite_speech

@zvik
Copy link
Copy Markdown
Author

zvik commented Apr 12, 2026

Hi @eustlb
I have now uploaded a model that can be used for testing: konszvi/granite-4.0-1b-speech-rt-v0.09.00-merged
I've also merged recent changes from main.
Please let me know how to proceed with this PR. If you wish, we can continue on slack.
Thanks

Comment on lines 287 to 290
_can_record_outputs = {
"hidden_states": GraniteSpeechConformerBlock,
"attentions": GraniteSpeechConformerAttention,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?

Comment on lines +331 to +332
if len(exported_hidden_states) > 0:
hidden_states = torch.cat(exported_hidden_states + [hidden_states], dim=-1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to do all this. output recording should return whatever you need

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above, output recording returns too many layers

Copy link
Copy Markdown
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #45512 to be able to do

OutputRecorder(GraniteSpeechConformerBlock, layer_name=r"layers\.(6|12|18)$")

@zvik now we'd want to close this PR and open a specific one ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants