Add option to export encoder hidden states for Granite-speech#44408
Add option to export encoder hidden states for Granite-speech#44408zvik wants to merge 13 commits intohuggingface:mainfrom
Conversation
New configuration option `encoder_hidden_layer` allow to pass hidden layers from the encoder to the projector.
Call was failing when output_hidden_states was set in kwargs (failed unit tests)
Add a test to verify that the size of the encoder output time the number of concatenated layers matches the size of the projector input.
Change import for fail check in PR huggingface#44408
Failed ruff check for PR huggingface#44408
eustlb
left a comment
There was a problem hiding this comment.
Hey @zvik, thanks a lot for raising this PR!
Interesting approach, and I do get the motivations that are totally relevant. We might want to go for a less custom, more standardised approach here that would leverage the @capture_outputs flag, cc @ArthurZucker
Also, if this feature isn't intended for this specific model, we'll refrain from merging it here. We'd rather have it directly with the new granite-speech models, as this would better align with the lib's philosophy. You can definitely use this branch to prototype with it, though! Also, we'd be glad to help with day-0 integration of the new models, I'll send a message in our common slack channel to discuss it 🤗
eustlb
left a comment
There was a problem hiding this comment.
I am guessing that the layers we'd want to pass to the projector will be fixed with the model to be released. I'd rather have it therefore hardcoded directly like this, but the same can also be done using a config parameter if it is necessary
| _can_record_outputs = { | ||
| "hidden_states": GraniteSpeechConformerBlock, | ||
| "attentions": GraniteSpeechConformerAttention, | ||
| } |
There was a problem hiding this comment.
| _can_record_outputs = { | |
| "hidden_states": [ | |
| OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.0"), | |
| OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1"), | |
| OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.2"), | |
| ], | |
| "attentions": GraniteSpeechConformerAttention, | |
| } |
There was a problem hiding this comment.
Thanks @eustlb , I'll check this method and see if it can be used for what we need.
There was a problem hiding this comment.
@eustlb I tried looking at your suggested direction. However, there seems to be a problem with the implementation of the capture_outputs. If for example you use OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1") this will match layer.1 but also layer.10, layer.11, layer.12 ...
As far as I can see, the problem is with the condition at
There was a problem hiding this comment.
OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?
There was a problem hiding this comment.
OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$")something like this + use regex?
@ArthurZucker As far as I can see, regex are not supported:
There was a problem hiding this comment.
Should be fairly easy to add no? Seems like a good feature
Resolve conflict in configuration_granite_speech.py: adopt main's @strict/@auto_docstring dataclass style while preserving encoder_hidden_layers field and its validation logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: granite_speech |
|
Hi @eustlb |
| _can_record_outputs = { | ||
| "hidden_states": GraniteSpeechConformerBlock, | ||
| "attentions": GraniteSpeechConformerAttention, | ||
| } |
There was a problem hiding this comment.
OutputRecorder(GraniteSpeechConformerBlock, layer_name="layers.1$") something like this + use regex?
| if len(exported_hidden_states) > 0: | ||
| hidden_states = torch.cat(exported_hidden_states + [hidden_states], dim=-1) |
There was a problem hiding this comment.
you don't need to do all this. output recording should return whatever you need
There was a problem hiding this comment.
As discussed above, output recording returns too many layers
What does this PR do?
This PR allows the Granite-speech model to use hidden states from the encoder hidden layers.
This is an internal model option that is required for the next generation of Granite-speech models.
Changes:
We chose to add a new parameter instead of using the output_hidden_states option for the following reasons:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
(@gsaon @avihu111 )