Add option to export encoder hidden states for Granite-speech by zvik · Pull Request #44408 · huggingface/transformers

zvik · 2026-03-03T07:50:39Z

What does this PR do?

This PR allows the Granite-speech model to use hidden states from the encoder hidden layers.

This is an internal model option that is required for the next generation of Granite-speech models.

Changes:

New config parameter: encoder_hidden_layers: list[int] | None # e.g., [6, 12, 18]
Validation for the parameter values
Concatenation of the required hidden layers to the final output of the encoder

We chose to add a new parameter instead of using the output_hidden_states option for the following reasons:

The hidden layers from the encoder are only used internally and are not needed outside of the model.
The output_hidden_states return all the layers instead of just those that are needed.
In our case, we only need a small number of hidden layers. Returning all of them with the output_hidden_states flag can increase the memory footprint, especially for large audio data. This can impact the allowed decoding batch size.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

audio models: @eustlb @ebezzam @vasqu

(@gsaon @avihu111 )

New configuration option `encoder_hidden_layer` allow to pass hidden layers from the encoder to the projector.

Call was failing when output_hidden_states was set in kwargs (failed unit tests)

Add a test to verify that the size of the encoder output time the number of concatenated layers matches the size of the projector input.

Change import for fail check in PR huggingface#44408

Failed ruff check for PR huggingface#44408

eustlb

Hey @zvik, thanks a lot for raising this PR!

Interesting approach, and I do get the motivations that are totally relevant. We might want to go for a less custom, more standardised approach here that would leverage the @capture_outputs flag, cc @ArthurZucker

Also, if this feature isn't intended for this specific model, we'll refrain from merging it here. We'd rather have it directly with the new granite-speech models, as this would better align with the lib's philosophy. You can definitely use this branch to prototype with it, though! Also, we'd be glad to help with day-0 integration of the new models, I'll send a message in our common slack channel to discuss it 🤗

eustlb

I am guessing that the layers we'd want to pass to the projector will be fixed with the model to be released. I'd rather have it therefore hardcoded directly like this, but the same can also be done using a config parameter if it is necessary

eustlb · 2026-03-03T12:22:34Z

    _can_record_outputs = {
        "hidden_states": GraniteSpeechConformerBlock,
        "attentions": GraniteSpeechConformerAttention,
    }


Suggested change

_can_record_outputs = {

"hidden_states": [

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.0"),

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1"),

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.2"),

],

"attentions": GraniteSpeechConformerAttention,

}

Thanks @eustlb , I'll check this method and see if it can be used for what we need.

@zvik, are you in our colab slack channel ?

No, can you send a link to join? @eustlb

@eustlb I tried looking at your suggested direction. However, there seems to be a problem with the implementation of the capture_outputs. If for example you use OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1") this will match layer.1 but also layer.10, layer.11, layer.12 ...
As far as I can see, the problem is with the condition at

transformers/src/transformers/utils/output_capturing.py

Line 150 in f60c4e9

if specs.layer_name is not None and specs.layer_name not in module_name:

and I can think of a simple method to bypass this.

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?

OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?

@ArthurZucker As far as I can see, regex are not supported:

transformers/src/transformers/utils/output_capturing.py

Line 145 in 27fbb51

if specs.layer_name is not None and specs.layer_name not in module_name:

Should be fairly easy to add no? Seems like a good feature

Resolve conflict in configuration_granite_speech.py: adopt main's @strict/@auto_docstring dataclass style while preserving encoder_hidden_layers field and its validation logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-12T10:56:02Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: granite_speech

zvik · 2026-04-12T11:07:20Z

Hi @eustlb
I have now uploaded a model that can be used for testing: konszvi/granite-4.0-1b-speech-rt-v0.09.00-merged
I've also merged recent changes from main.
Please let me know how to proceed with this PR. If you wish, we can continue on slack.
Thanks

ArthurZucker · 2026-04-14T13:25:28Z

    _can_record_outputs = {
        "hidden_states": GraniteSpeechConformerBlock,
        "attentions": GraniteSpeechConformerAttention,
    }


OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?

ArthurZucker · 2026-04-14T13:26:20Z

+        if len(exported_hidden_states) > 0:
+            hidden_states = torch.cat(exported_hidden_states + [hidden_states], dim=-1)


you don't need to do all this. output recording should return whatever you need

As discussed above, output recording returns too many layers

eustlb

Opened #45512 to be able to do

OutputRecorder(GraniteSpeechConformerBlock, layer_name=r"layers\.(6|12|18)$")

@zvik now we'd want to close this PR and open a specific one ;)

zvik and others added 12 commits March 1, 2026 16:04

Support for hidden layers from the encoder

86c0a86

New configuration option `encoder_hidden_layer` allow to pass hidden layers from the encoder to the projector.

Fix a typo

a9dcfce

Fix encoder call

2009114

Call was failing when output_hidden_states was set in kwargs (failed unit tests)

Fix formatting with ruff

718873e

Verify correct projector size

663aa73

Add a test to verify that the size of the encoder output time the number of concatenated layers matches the size of the projector input.

Keep only useful hidden states

cfcc551

Remove unused code

1e24fd4

Merge branch 'huggingface:main' into zk_granitespeech_enc_hidden_states

4398b47

Change import

84e63da

Change import for fail check in PR huggingface#44408

Orgnize imports

178eeb1

Failed ruff check for PR huggingface#44408

Fix imports with ruff

d4c4efe

Files reformatted with ruff

f11ac6a

eustlb reviewed Mar 3, 2026

View reviewed changes

ArthurZucker reviewed Apr 14, 2026

View reviewed changes

eustlb mentioned this pull request Apr 19, 2026

[OutputRecorder] re.search on layer_name #45512

Open

eustlb reviewed Apr 19, 2026

View reviewed changes

zvik mentioned this pull request Apr 29, 2026

Support for a new Granite-Speech-Plus model #45695

Merged

6 tasks

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

+_can_record_outputs = {
+        "hidden_states": [
+            OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.0"),
+            OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1"),
+            OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.2"),
+        ],
+        "attentions": GraniteSpeechConformerAttention,
+}

		if len(exported_hidden_states) > 0:
		hidden_states = torch.cat(exported_hidden_states + [hidden_states], dim=-1)

Conversation

zvik commented Mar 3, 2026

What does this PR do?

Changes:

Before submitting

Who can review?

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zvik Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

zvik commented Apr 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zvik Mar 3, 2026 •

edited

Loading