[Model] Add SLANet Model Support#45532
Conversation
vasqu
left a comment
There was a problem hiding this comment.
Some initial comments from my side, already looks pretty good! It's mostly smaller details and a new way we cache images for our tests
| <hfoption id="AutoModel"> | ||
|
|
||
| ```py | ||
| import requests |
There was a problem hiding this comment.
Might have missed these the last models, but would be nice if we could use httpx instead - we cleaned our code base to use httpx internally so would be nice to have it in examples as well
There was a problem hiding this comment.
Neat, already used the auto mappings :D
| out_channels: int = 50 | ||
| hidden_size: int = 256 | ||
| max_text_length: int = 500 |
There was a problem hiding this comment.
| out_channels: int = 50 | |
| hidden_size: int = 256 | |
| max_text_length: int = 500 | |
| hidden_size: int = 256 |
should be inherited instead, they have the same values if I see it correctly
|
|
||
| hidden_act: str = "hardswish" | ||
| csp_kernel_size: int = 5 | ||
| csp_blocks_num: int = 1 |
There was a problem hiding this comment.
| csp_blocks_num: int = 1 | |
| csp_num_blocks: int = 1 |
nit: naming to be consistent with the layer kwarg
| csp_kernel_size (`int`, *optional*, defaults to 5): | ||
| The kernel size of the CSP layer. | ||
| csp_blocks_num (`int`, *optional*, defaults to 1): | ||
| Number of the CSP layer. |
There was a problem hiding this comment.
I think in the docs we can at least also have the full name behind what CSP means
| ) | ||
| ) | ||
|
|
||
| def forward(self, hidden_states): |
| class SLANetModel(SLANetPreTrainedModel): | ||
| def __init__(self, config: SLANetConfig): | ||
| super().__init__(config) | ||
| self.backbone = load_backbone(config) | ||
| self.neck = SLANetCSPPAN(self.backbone.num_features[2:], config) | ||
|
|
||
| self.post_init() | ||
|
|
||
| @merge_with_config_defaults | ||
| @capture_outputs | ||
| def forward( | ||
| self, hidden_states: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs] | ||
| ) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput: | ||
| outputs = self.backbone(hidden_states, **kwargs) | ||
| hidden_states = self.neck(outputs.feature_maps) | ||
| return BaseModelOutputWithNoAttention( | ||
| last_hidden_state=hidden_states, | ||
| hidden_states=outputs.hidden_states, | ||
| ) |
There was a problem hiding this comment.
Imo, this should be SLANetBackbone instead, similar to SLANeXtBackbone
Not that it needs the same naming but from the structure it should be the same
class SLANetBackbone(SLANetPreTrainedModel):
def __init__(self, config: SLANetConfig):
super().__init__(config)
self.vision_backbone = load_backbone(config)
self.post_cs_pan = SLANetCSPPAN(self.backbone.num_features[2:], config)
self.post_init()
@can_return_tuple
@auto_docstring
def forward(
self, hidden_states: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs]
) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput:
outputs = self.vision_backbone(hidden_states, **kwargs)
hidden_states = self.post_cs_pan(outputs.feature_maps)
return BaseModelOutputWithNoAttention(
last_hidden_state=hidden_states,
hidden_states=outputs.hidden_states,
)
Just as an example
| class SLANetForTableRecognition(SLANetPreTrainedModel): | ||
| _keys_to_ignore_on_load_missing = ["num_batches_tracked"] | ||
|
|
||
| def __init__(self, config: SLANetConfig): | ||
| super().__init__(config) | ||
| self.model = SLANetModel(config=config) | ||
| self.head = SLANetSLAHead(config=config) | ||
| self.post_init() | ||
|
|
||
| @can_return_tuple | ||
| @auto_docstring | ||
| def forward( | ||
| self, pixel_values: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs] | ||
| ) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput: | ||
| outputs = self.model(pixel_values, **kwargs) | ||
| head_outputs = self.head(outputs.last_hidden_state, **kwargs) | ||
| return SLANetForTableRecognitionOutput( | ||
| last_hidden_state=head_outputs.last_hidden_state, | ||
| hidden_states=outputs.hidden_states, | ||
| head_hidden_states=head_outputs.hidden_states, | ||
| head_attentions=head_outputs.attentions, | ||
| ) |
There was a problem hiding this comment.
| class SLANetForTableRecognition(SLANetPreTrainedModel): | |
| _keys_to_ignore_on_load_missing = ["num_batches_tracked"] | |
| def __init__(self, config: SLANetConfig): | |
| super().__init__(config) | |
| self.model = SLANetModel(config=config) | |
| self.head = SLANetSLAHead(config=config) | |
| self.post_init() | |
| @can_return_tuple | |
| @auto_docstring | |
| def forward( | |
| self, pixel_values: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs] | |
| ) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput: | |
| outputs = self.model(pixel_values, **kwargs) | |
| head_outputs = self.head(outputs.last_hidden_state, **kwargs) | |
| return SLANetForTableRecognitionOutput( | |
| last_hidden_state=head_outputs.last_hidden_state, | |
| hidden_states=outputs.hidden_states, | |
| head_hidden_states=head_outputs.hidden_states, | |
| head_attentions=head_outputs.attentions, | |
| ) | |
| class SLANetForTableRecognition(SLANeXtForTableRecognition): | |
| _keys_to_ignore_on_load_missing = ["num_batches_tracked"] | |
| @can_return_tuple | |
| @auto_docstring | |
| def forward( | |
| self, pixel_values: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs] | |
| ) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput: | |
| outputs = self.model(pixel_values, **kwargs) | |
| head_outputs = self.head(outputs.last_hidden_state, **kwargs) | |
| # Key difference: no attentions in its vision model | |
| return SLANetForTableRecognitionOutput( | |
| last_hidden_state=head_outputs.last_hidden_state, | |
| hidden_states=outputs.hidden_states, | |
| head_hidden_states=head_outputs.hidden_states, | |
| head_attentions=head_outputs.attentions, | |
| ) |
There was a problem hiding this comment.
Then we also don't override base_model_prefix (i.e. base_model_prefix = "backbone" )and use the same namings
| expected_arg_names = ["pixel_values"] | ||
| self.assertListEqual(arg_names[:1], expected_arg_names) | ||
|
|
||
| def test_hidden_states_output(self): |
There was a problem hiding this comment.
lets add a small docstring for why we override
| url = "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/table_recognition.jpg" | ||
| self.image = Image.open(requests.get(url, stream=True).raw) |
There was a problem hiding this comment.
We changed the logic a bit to cache more, can you try to refactor
- Add to the list of images
- Then use
url_to_local_pathto get the appropriate (cached) image pathtransformers/tests/models/gemma4/test_modeling_gemma4.py
Lines 431 to 433 in ef97a75
Tbh, might be better to do for all PP models that have been added (in a separate PR)
There was a problem hiding this comment.
Done. If this change looks good to you, I’ll open a new PR to update the previous models as well.
vasqu
left a comment
There was a problem hiding this comment.
Looks super good! Just some last nits, but nothing big
Re: the image caching, yes would be nice if you could another PR for that for other models 🤗
There was a problem hiding this comment.
I think we need to add the slanext image processor to the auto mappings in image_processing_auto (under MISSING_IMAGE_PROCESSOR_MAPPING_NAMES)
Seems like the auto mappings didnt pick it up
| csp_kernel_size (`int`, *optional*, defaults to 5): | ||
| The kernel size of the Cross Stage Partial (CSP) layer. | ||
| csp_num_blocks (`int`, *optional*, defaults to 1): | ||
| Number of the Cross Stage Partial (CSP) layer. |
There was a problem hiding this comment.
| Number of the Cross Stage Partial (CSP) layer. | |
| Number of blocks within the Cross Stage Partial (CSP) layer. |
| def __init__(self, config: SLANetConfig): | ||
| super().__init__(config) | ||
| self.vision_backbone = load_backbone(config) | ||
| self.post_csp_pan = SLANetCSPPAN(self.vision_backbone.num_features[2:], config) |
There was a problem hiding this comment.
Super nit: the order is a bit weird, would change it to have the config as 1st arg
| @merge_with_config_defaults | ||
| @capture_outputs | ||
| def forward( | ||
| self, hidden_states: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs] | ||
| ) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput: | ||
| outputs = self.vision_backbone(hidden_states, **kwargs) | ||
| hidden_states = self.post_csp_pan(outputs.feature_maps) | ||
| return BaseModelOutputWithNoAttention( | ||
| last_hidden_state=hidden_states, | ||
| hidden_states=outputs.hidden_states, | ||
| ) |
There was a problem hiding this comment.
| @merge_with_config_defaults | |
| @capture_outputs | |
| def forward( | |
| self, hidden_states: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs] | |
| ) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput: | |
| outputs = self.vision_backbone(hidden_states, **kwargs) | |
| hidden_states = self.post_csp_pan(outputs.feature_maps) | |
| return BaseModelOutputWithNoAttention( | |
| last_hidden_state=hidden_states, | |
| hidden_states=outputs.hidden_states, | |
| ) | |
| @can_return_tuple | |
| @auto_docstring | |
| def forward( | |
| self, hidden_states: torch.FloatTensor, **kwargs: Unpack[TransformersKwargs] | |
| ) -> tuple[torch.FloatTensor] | SLANetForTableRecognitionOutput: | |
| outputs = self.vision_backbone(hidden_states, **kwargs) | |
| hidden_states = self.post_csp_pan(outputs.feature_maps) | |
| return BaseModelOutputWithNoAttention( | |
| last_hidden_state=hidden_states, | |
| hidden_states=outputs.hidden_states, | |
| ) |
I think we don't need those decorators, we rely on the backbone to collect the hidden states and it does so by itself
| class SLANetForTableRecognitionOutput(SLANeXtForTableRecognitionOutput): | ||
| pass |
There was a problem hiding this comment.
I think we should not inherit here and inherit from BaseModelOutputWithNoAttention instead but with the same additions as SLANeXt; just to avoid confusion about attentions being in the output
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, slanet |
#45562 PTAL.🤗 |
|
run-slow: slanet |
|
This comment contains models: ["models/slanet"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
No description provided.