[Model] Add PP-OCRV5_mobile_det Model Support #43247
[Model] Add PP-OCRV5_mobile_det Model Support #43247vasqu merged 55 commits intohuggingface:mainfrom
Conversation
yonigozlan
left a comment
There was a problem hiding this comment.
Hello @XingweiDeng, thanks for opening this PR!, As I said to @liu-jiaxuan in the PP-OCRV5_mobile_rec PR, there is quite a bit to change here to fit the standards of the Transformers library.
The biggest issue is that you've written everything from scratch without inheriting from existing models. The modular file should maximize inheritance. Even if this is a novel architecture (especially the Conv modules part, which might not exist elsewhere in the library), components like MLP blocks, attention, and layer norms should use standard library patterns by inheriting form an existing model's module in modular.
The novel modules that can't be inherited through modular should also follow library standards in terms of naming, formatting, structure and good-practices ("PPOCRV5MobileRec" prefix for all module names, weight names standardized with other similar modules in the library, no single letter variables, type hints, docstrings when args are not standards or obvious etc.), and the model should support as much transformers features as possible, such as the attention interface through flags in PreTrainedModel( _supports_attention_backend, _supports_sdpa, _supports_flash_attn etc.)
Some other big things wrong or missing:
- We shouldn't have a cv2 dependency in image processors, "slow" should use PiL/numpy functions, fast torch/torchvision.
- Weight initialization shouldn't be scattered in individual module constructors but centralized in _init_weights() on the PreTrainedModel class, and use the transfromers "init" module.
- Attention modules are standardized across models in the transformers library, so using modular for attention modules is a must.
Before we go deeper in reviewing this new model addition (and other Paddle Paddle ones open recently that are very similar), please have a good look at how other models are implemented in the library. Notably, you can have a look at the recently merged PP-DocLayoutV3 PR (here's its modular file.
We also have resources to learn more about how to contribute a new model and how to use modular: Contributing a new model, using modular.
Also as the multiple Paddle Paddle models that have a new model addition PR open currently seem to be quite similar, I'd recommend focusing on one (the simplest) for now, then we'll be able to leverage modular to easily add the other models.
Happy to answer any questions you may have!
|
Hello @yonigozlan , thank you very much for your detailed review and valuable suggestions! |
|
Hello @yonigozlan , We have revised the four models (pp_ocrv5_mobile_det, pp_ocrv5_server_det, pp_lcnet, uvdoc) to address the issues you mentioned. |
|
Hi @XingweiDeng ! Thanks a lot for iterating! I'll have a look in the coming days |
|
Hi @yonigozlan, We have made further updates based on the previous refinements to make the code more compliant with the official Transformers library standards. Please feel free to share any comments or suggestions you may have — we will promptly make revisions and follow up accordingly. |
yonigozlan
left a comment
There was a problem hiding this comment.
Hello @XingweiDeng ! Thanks a lot for iterating on this, This looks much better!!
Main comment is that you seem to have quite an outdated branch here, so the first thing to do would be to rebase or merge with main. Then update the model to the latest standards as indicated in my review, and once that's done we shouldn't be too far from ready to merge!
Also let's add a test file for the image processor(s)
| self.interpolate_mode = interpolate_mode | ||
|
|
||
| # ---- Head ---- | ||
| self.k = k |
There was a problem hiding this comment.
Let's use a better name for this, also is it used anywhere?
| def process( | ||
| logit: np.ndarray, | ||
| size: np.ndarray, | ||
| threshold: float, | ||
| box_thresh: float, | ||
| unclip_ratio: float, | ||
| min_size: int, | ||
| max_candidates: int, | ||
| ) -> tuple[Union[list[np.ndarray], np.ndarray], list[float]]: | ||
| """ | ||
| Main post-processing function to convert model predictions into text boxes. | ||
|
|
||
| Args: | ||
| logit (torch.Tensor): Model output of shape (1, H, W). | ||
| size (torch.Tensor): Original image size (height, width). | ||
| threshold (float): Threshold for binarizing the prediction map. | ||
| box_thresh (float): Score threshold for filtering boxes. | ||
| unclip_ratio (float): Expansion ratio for unclipping. | ||
| min_size (int): Minimum side length of valid boxes. | ||
| max_candidates (int): Maximum number of boxes to extract. | ||
|
|
||
| Returns: | ||
| tuple: | ||
| - boxes (list or np.ndarray): Extracted text boxes. | ||
| - scores (list): Corresponding confidence scores. | ||
| """ | ||
| src_height, src_width = size | ||
| mask = logit > threshold | ||
| boxes, scores = boxes_from_bitmap(logit, mask, src_width, src_height, box_thresh, unclip_ratio, min_size, max_candidates) | ||
| return boxes, scores |
There was a problem hiding this comment.
I don't think we need a separate method for this, let's unroll it directly in the post_process method
| outputs, | ||
| threshold: float = 0.3, | ||
| target_sizes: Optional[Union[list[tuple[int, int]], torch.Tensor]] = None, | ||
| box_thresh: float = 0.6, | ||
| max_candidates: int = 1000, | ||
| min_size: int = 3, | ||
| unclip_ratio: float = 1.5, | ||
| ): | ||
| """ | ||
| Converts model outputs into detected text boxes. | ||
|
|
||
| Args: | ||
| preds (torch.Tensor): Model outputs. | ||
| target_sizes (TensorType or list[tuple]): Original image sizes. | ||
| threshold (float): Binarization threshold. | ||
| box_thresh (float): Box score threshold. | ||
| max_candidates (int): Maximum number of boxes. | ||
| min_size (int): Minimum box size. | ||
| unclip_ratio (float): Expansion ratio. | ||
|
|
||
| Returns: | ||
| list[dict]: List of detection results. | ||
| """ | ||
|
|
||
| results = [] | ||
| for logit, size in zip(outputs.logits, target_sizes): | ||
| box, score = process( | ||
| logit=logit[0, :, :].cpu().detach().numpy(), | ||
| size=size.cpu().detach().numpy(), | ||
| threshold=threshold, | ||
| box_thresh=box_thresh, | ||
| unclip_ratio=unclip_ratio, | ||
| min_size=min_size, | ||
| max_candidates=max_candidates, | ||
| ) | ||
| results.append({"scores": score, "boxes": box}) | ||
| return results | ||
|
|
||
| def get_image_size( | ||
| self, | ||
| image: np.ndarray, | ||
| limit_side_len: int, | ||
| limit_type: str, | ||
| max_side_limit: int = 4000, | ||
| ) -> tuple[dict, np.ndarray]: | ||
| """ | ||
| Computes the target size for resizing an image while preserving aspect ratio. | ||
|
|
||
| Args: | ||
| image (torch.Tensor): Input image. | ||
| limit_side_len (int): Maximum or minimum side length. | ||
| limit_type (str): Resizing strategy: "max", "min", or "resize_long". | ||
| max_side_limit (int): Maximum allowed side length. | ||
|
|
||
| Returns: | ||
| tuple: | ||
| - SizeDict: Target size. | ||
| - torch.Tensor: Original size. | ||
| """ | ||
| limit_side_len = limit_side_len or self.limit_side_len | ||
| limit_type = limit_type or self.limit_type | ||
| height, width, _ = image.shape | ||
|
|
||
| if limit_type == "max": | ||
| if max(height, width) > limit_side_len: | ||
| if height > width: | ||
| ratio = float(limit_side_len) / height | ||
| else: | ||
| ratio = float(limit_side_len) / width | ||
| else: | ||
| ratio = 1.0 | ||
| elif limit_type == "min": | ||
| if min(height, width) < limit_side_len: | ||
| if height < width: | ||
| ratio = float(limit_side_len) / height | ||
| else: | ||
| ratio = float(limit_side_len) / width | ||
| else: | ||
| ratio = 1.0 | ||
| elif limit_type == "resize_long": | ||
| ratio = float(limit_side_len) / max(height, width) | ||
| else: | ||
| raise Exception("not support limit type, image ") | ||
| resize_height = int(height * ratio) | ||
| resize_width = int(width * ratio) | ||
|
|
||
| if max(resize_height, resize_width) > max_side_limit: | ||
| ratio = float(max_side_limit) / max(resize_height, resize_width) | ||
| resize_height, resize_width = int(resize_height * ratio), int(resize_width * ratio) | ||
|
|
||
| resize_height = max(int(round(resize_height / 32) * 32), 32) | ||
| resize_width = max(int(round(resize_width / 32) * 32), 32) | ||
|
|
||
| if resize_height == height and resize_width == width: | ||
| return {"height": resize_height, "width": resize_width}, np.array([height, width]) | ||
|
|
||
| if int(resize_width) <= 0 or int(resize_height) <= 0: | ||
| return None, (None, None) | ||
|
|
||
| return {"height": resize_height, "width": resize_width}, np.array([height, width]) |
There was a problem hiding this comment.
If the results obtained with the fast image processors are similar enough, we can completely discard the pil based processor and only keep the fast one
| limit_side_len = 960 | ||
| limit_type = "max" | ||
| max_side_limit = 4000 |
There was a problem hiding this comment.
Let's add these 3 attributes to the supported kwargs for this model, so that they can be provided at call time and init time.
You can have a look at other image processor fast in the library (like llava-next) to see how that should be done
| def post_process_object_detection( | ||
| self, | ||
| outputs, | ||
| threshold: float = 0.3, | ||
| target_sizes: Optional[Union[list[tuple[int, int]], torch.Tensor]] = None, | ||
| box_thresh: float = 0.6, | ||
| max_candidates: int = 1000, | ||
| min_size: int = 3, | ||
| unclip_ratio: float = 1.5, | ||
| ): | ||
| """ | ||
| Converts model outputs into detected text boxes. | ||
|
|
||
| Args: | ||
| preds (torch.Tensor): Model outputs. | ||
| threshold (float):Binarization threshold. | ||
| target_sizes (TensorType or list[tuple]): Original image sizes. | ||
| box_thresh (float): Box score threshold. | ||
| max_candidates (int): Maximum number of boxes. | ||
| min_size (int): Minimum box size. | ||
| unclip_ratio (float): Expansion ratio. | ||
|
|
||
| Returns: | ||
| list[dict]: List of detection results. | ||
| """ | ||
|
|
||
| results = [] | ||
| for logit, size in zip(outputs.logits, target_sizes): | ||
| box, score = process( | ||
| logit=logit[0, :, :].cpu().detach().numpy(), | ||
| size=size.cpu().detach().numpy(), | ||
| threshold=threshold, | ||
| box_thresh=box_thresh, | ||
| unclip_ratio=unclip_ratio, | ||
| min_size=min_size, | ||
| max_candidates=max_candidates, | ||
| ) | ||
|
|
||
| results.append( | ||
| { | ||
| "boxes": box, | ||
| "scores": score, | ||
| } | ||
| ) | ||
| return results |
There was a problem hiding this comment.
Let's put this method at the bottom of the module
| and returns outputs compatible with the Transformers object detection API. | ||
| """ | ||
|
|
||
| _keys_to_ignore_on_load_missing = ["num_batches_tracked"] |
There was a problem hiding this comment.
Is this key present in the original checkpoint? otherwise we can remove this
| def forward( | ||
| self, | ||
| pixel_values: torch.FloatTensor, | ||
| labels: Optional[list[dict]] = None, |
There was a problem hiding this comment.
let's remove labels if it's not used
| Returns: | ||
| Union[tuple[torch.FloatTensor], PPOCRV5MobileDetForObjectDetectionOutput]: Detection output containing | ||
| segmentation logits, last hidden state, and optional hidden states. | ||
| """ |
There was a problem hiding this comment.
Fully remove and use auto_docstring
| """ |
| class PPOCRV5MobileDetForObjectDetection(PPOCRV5MobileDetPreTrainedModel): | ||
| """ | ||
| PPOCRV5 Mobile Det model for object (text) detection tasks. Wraps the core PPOCRV5MobileDetModel | ||
| and returns outputs compatible with the Transformers object detection API. | ||
| """ | ||
|
|
||
| _keys_to_ignore_on_load_missing = ["num_batches_tracked"] | ||
|
|
||
| def __init__(self, config: PPOCRV5MobileDetConfig): | ||
| """ | ||
| Initialize the PPOCRV5MobileDetForObjectDetection with the specified configuration. | ||
|
|
||
| Args: | ||
| config (PPOCRV5MobileDetConfig): Configuration object containing all model hyperparameters. | ||
| """ | ||
| super().__init__(config) | ||
| self.model = PPOCRV5MobileDetModel(config) | ||
| self.post_init() | ||
|
|
||
| def forward( | ||
| self, | ||
| pixel_values: torch.FloatTensor, | ||
| labels: Optional[list[dict]] = None, | ||
| output_hidden_states: Optional[bool] = None, | ||
| return_dict: Optional[bool] = None, | ||
| **kwargs, | ||
| ) -> Union[tuple[torch.FloatTensor], PPOCRV5MobileDetForObjectDetectionOutput]: | ||
| """ | ||
| Forward pass of the PPOCRV5MobileDetForObjectDetection model, processing input images to generate | ||
| text detection logits. | ||
|
|
||
| Args: | ||
| pixel_values (torch.FloatTensor): Input image tensor of shape (B, 3, H, W) (preprocessed pixel values). | ||
| labels (list[dict], optional): Unused placeholder for training (object detection labels). Defaults to None. | ||
| output_hidden_states (bool, optional): Whether to return all intermediate hidden states from the backbone. | ||
| If None, uses the configuration's `output_hidden_states` value. | ||
| return_dict (bool, optional): Whether to return a `PPOCRV5MobileDetForObjectDetectionOutput` object or a tuple. | ||
| If None, uses the configuration's `use_return_dict` value. | ||
| **kwargs: Additional unused keyword arguments for compatibility. | ||
|
|
||
| Returns: | ||
| Union[tuple[torch.FloatTensor], PPOCRV5MobileDetForObjectDetectionOutput]: Detection output containing | ||
| segmentation logits, last hidden state, and optional hidden states. | ||
| """ | ||
| output_hidden_states = ( | ||
| output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states | ||
| ) | ||
| return_dict = return_dict if return_dict is not None else self.config.use_return_dict | ||
|
|
||
| outputs = self.model(pixel_values, output_hidden_states=output_hidden_states, return_dict=return_dict) | ||
|
|
||
| if not return_dict: | ||
| output = (outputs[0],) | ||
| if output_hidden_states: | ||
| output += (outputs[1], outputs[2]) | ||
| else: | ||
| output += (outputs[1],) | ||
|
|
||
| return output | ||
|
|
||
| return PPOCRV5MobileDetForObjectDetectionOutput( | ||
| logits=outputs.logits, | ||
| last_hidden_state=outputs.last_hidden_state, | ||
| hidden_states=outputs.hidden_states if output_hidden_states else None, | ||
| ) |
There was a problem hiding this comment.
It looks like PPOCRV5MobileDetForObjectDetection PPOCRV5MobileDetModel actually return the same thing. Would it make sense to remove the head from PPOCRV5MobileDetModel and have it only in PPOCRV5MobileDetForObjectDetection ? Could PPOCRV5MobileDetModel be of any use then? (For custom heads for example)
vasqu
left a comment
There was a problem hiding this comment.
Great work! There are a few smaller things but nothing major imo. Let's wrap it up tomorrow then 🤗
| image_processor = AutoImageProcessor.from_pretrained(model_path).to(model.device) | ||
|
|
||
| image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/img_rot180_demo.jpg", stream=True).raw) | ||
| inputs = image_processor(images=[image, image], return_tensors="pt") |
There was a problem hiding this comment.
| inputs = image_processor(images=[image, image], return_tensors="pt") | |
| inputs = image_processor(images=[image, image], return_tensors="pt").to(model.device) |
There was a problem hiding this comment.
Oh, seeing that the image processor is moved to the device here. Let's sync the examples at least to either do that or the other version
| from transformers import AutoImageProcessor, AutoModelForObjectDetection | ||
|
|
||
| model_path="PaddlePaddle/PP-OCRv5_mobile_det_safetensors" | ||
| model = AutoModelForObjectDetection.from_pretrained(model_path) |
There was a problem hiding this comment.
Let's add devic_map here as well (with either version with device movement for the processor input)
| from transformers import AutoImageProcessor, AutoModelForObjectDetection | ||
|
|
||
| model_path="PaddlePaddle/PP-OCRv5_mobile_det_safetensors" | ||
| model = AutoModelForObjectDetection.from_pretrained(model_path) |
| @filter_output_hidden_states | ||
| @can_return_tuple |
There was a problem hiding this comment.
| @filter_output_hidden_states | |
| @can_return_tuple | |
| @can_return_tuple | |
| @filter_output_hidden_states |
super nitpicky, but let's keep the same order as elsewhere
| >>> feature_maps = outputs.feature_maps | ||
| >>> list(feature_maps[-1].shape) | ||
| ```""" | ||
| kwargs["output_hidden_states"] = True |
There was a problem hiding this comment.
| kwargs["output_hidden_states"] = True | |
| kwargs["output_hidden_states"] = True # required to extract layers for the stages |
| return fused_feature_map | ||
|
|
||
|
|
||
| class PPOCRV5MobileDetHead(nn.Module): |
There was a problem hiding this comment.
Reopenin this one - can we inherit somehow form the server head or similar?
| interpolation: Optional["tvF.InterpolationMode"], | ||
| **kwargs, | ||
| ) -> BatchFeature: | ||
| requires_backends(self, ["torch"]) |
There was a problem hiding this comment.
That's on me, I think we can directly move on top of the image processor - there is some decorator @requires(backends=("torch")) should be it. I think we should do the same for the server image processor iirc - missed that in the other PR
There was a problem hiding this comment.
Better late than never, I’ve just added it for server_det as well.
| @@ -0,0 +1,121 @@ | |||
| # coding = utf-8 | |||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |||
There was a problem hiding this comment.
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |
| # Copyright 2026 The HuggingFace Inc. team. All rights reserved. |
slipped through, too many files :D
There was a problem hiding this comment.
sry, I will check all files
| @is_flaky() | ||
| def test_batching_equivalence(self, atol=5e-2, rtol=5e-2): | ||
| super().test_batching_equivalence(atol=atol, rtol=rtol) | ||
|
|
There was a problem hiding this comment.
Oh is this needed? Would like to avoid but no problem if not
|
@vasqu I’ve finished the changes, PTAL. You can aslo run slow-test. 🤗 |
There was a problem hiding this comment.
Just pushed this, still something to fix on our side that this is moved but this was missing
vasqu
left a comment
There was a problem hiding this comment.
My last comments now! Quick fixed something else on pp lcnet re modular but should be good now
| return fused_feature_map | ||
|
|
||
|
|
||
| class PPOCRV5MobileDetHead(nn.Module): |
There was a problem hiding this comment.
Can we use PPOCRV5ServerDetSegmentationHead to inherit from modular here? Structure seems very similar
| if getattr(self.config, "output_hidden_states", False): # get output_hidden_states from config | ||
| kwargs["output_hidden_states"] = True |
There was a problem hiding this comment.
| if getattr(self.config, "output_hidden_states", False): # get output_hidden_states from config | |
| kwargs["output_hidden_states"] = True |
not needed, it is forced within the backbone either way
There was a problem hiding this comment.
Without this, test_hidden_states_output would fail. Cause I found that simply setting config.output_hidden_states = Truein def test_hidden_states_output doesn’t propagate to the backbone’s config. Do you have any suggestions for a better way to handle this?
There was a problem hiding this comment.
Check out def _set_subconfig_attributes in test_modeling_common - it recursively sets it to all subconfigs
|
|
||
|
|
||
| @auto_docstring | ||
| @requires(backends=("torch",)) |
| # @is_flaky() | ||
| # def test_batching_equivalence(self, atol=5e-2, rtol=5e-2): | ||
| # super().test_batching_equivalence(atol=atol, rtol=rtol) |
There was a problem hiding this comment.
| # @is_flaky() | |
| # def test_batching_equivalence(self, atol=5e-2, rtol=5e-2): | |
| # super().test_batching_equivalence(atol=atol, rtol=rtol) |
|
run-slow: pp_lcnet, pp_lcnet_v3, pp_ocrv5_mobile_det |
|
@zhang-prog running the slow tests meanwhile, you can push changes then after the results |
|
This comment contains models: ["models/pp_lcnet", "models/pp_lcnet_v3", "models/pp_ocrv5_mobile_det"] |
CI ResultsCommit Info
Model CI Report❌ 1 new failed tests from this PR 😭
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, pp_lcnet, pp_lcnet_v3, pp_ocrv5_mobile_det, pp_ocrv5_server_det |
|
@vasqu PTAL. please |
|
run-slow: pp_lcnet, pp_lcnet_v3, pp_ocrv5_mobile_det |
|
This comment contains models: ["models/pp_lcnet", "models/pp_lcnet_v3", "models/pp_ocrv5_mobile_det"] |
|
Some weird errors on CI, checking again but would merge if it turns green - don't think it's related to you |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@vasqu Thank you so much for your thorough reviews. Your feedback has been incredibly professional and efficient, and it’s been a pleasure collaborating with you. Looking ahead, our team hope merging five models next week (before March 20th): These PRs will be created by other members of my team, and I will be involved to ensure the code meets your standards, aiming to make the review process as smooth as possible for you. I was wondering if you would be available to help us review them as well. Looking forward to your reply. Thanks again! 🤗 |
|
@zhang-prog @XingweiDeng Thanks a lot for all these iterations and pulling through, has been a pleasure as well from my side! 🫡 Also congrats, that's a lot of models in a short timespan already :) I will definitely check around, I also raised some discussion internally so that other members in the team can help out (cc @yonigozlan @zucchini-nlp @ArthurZucker @Cyrilvallez). In any case, you can always ping me on slack on our shared slack channel. Let us know if there is anything else we can do 🤗 |
|
@vasqu This is great. We’ll be in touch next week once our PRs are ready for your review. |
|
Same, have a great weekend and rest up for next week 🤗 |
|
Hi @XingweiDeng and @zhang-prog Thank you for this work 🚀 ! I saw there is
in and the test file
don't have any integration test. Look forward for them being added once the checkpoint is released 🙏 . |
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.