[Model] Add PP-OCRV5_mobile_det Model Support by XingweiDeng · Pull Request #43247 · huggingface/transformers

XingweiDeng · 2026-01-13T08:58:19Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

yonigozlan

Hello @XingweiDeng, thanks for opening this PR!, As I said to @liu-jiaxuan in the PP-OCRV5_mobile_rec PR, there is quite a bit to change here to fit the standards of the Transformers library.

The biggest issue is that you've written everything from scratch without inheriting from existing models. The modular file should maximize inheritance. Even if this is a novel architecture (especially the Conv modules part, which might not exist elsewhere in the library), components like MLP blocks, attention, and layer norms should use standard library patterns by inheriting form an existing model's module in modular.

The novel modules that can't be inherited through modular should also follow library standards in terms of naming, formatting, structure and good-practices ("PPOCRV5MobileRec" prefix for all module names, weight names standardized with other similar modules in the library, no single letter variables, type hints, docstrings when args are not standards or obvious etc.), and the model should support as much transformers features as possible, such as the attention interface through flags in PreTrainedModel( _supports_attention_backend, _supports_sdpa, _supports_flash_attn etc.)

Some other big things wrong or missing:

We shouldn't have a cv2 dependency in image processors, "slow" should use PiL/numpy functions, fast torch/torchvision.
Weight initialization shouldn't be scattered in individual module constructors but centralized in _init_weights() on the PreTrainedModel class, and use the transfromers "init" module.
Attention modules are standardized across models in the transformers library, so using modular for attention modules is a must.

Before we go deeper in reviewing this new model addition (and other Paddle Paddle ones open recently that are very similar), please have a good look at how other models are implemented in the library. Notably, you can have a look at the recently merged PP-DocLayoutV3 PR (here's its modular file.
We also have resources to learn more about how to contribute a new model and how to use modular: Contributing a new model, using modular.

Also as the multiple Paddle Paddle models that have a new model addition PR open currently seem to be quite similar, I'd recommend focusing on one (the simplest) for now, then we'll be able to leverage modular to easily add the other models.

Happy to answer any questions you may have!

XingweiDeng · 2026-02-14T08:45:07Z

Hello @yonigozlan , thank you very much for your detailed review and valuable suggestions!
We have revised the four models (pp_ocrv5_mobile_det, pp_ocrv5_server_det, pp_lcnet, uvdoc) to address the issues you mentioned.
Notably, all four of these models are convolutional neural networks and do not involve attention modules, so we do not implement related attention components.
Regarding the image processing for pp_ocrv5_mobile_det and pp_ocrv5_server_det, we have replaced as many cv2 operations as possible with PIL and NumPy implementations.
However, we were unable to find drop-in replacements that preserve model accuracy for two specific functions: cv2.fillPoly and cv2.findContours, so we have not been able to remove these usages for now.
We will continue to refine the implementation in line with the library’s standards and conventions. Please let us know if you have any further comments or guidance.
Thank you again for your help!

XingweiDeng · 2026-02-25T07:41:49Z

Hello @yonigozlan , We have revised the four models (pp_ocrv5_mobile_det, pp_ocrv5_server_det, pp_lcnet, uvdoc) to address the issues you mentioned.

yonigozlan · 2026-02-26T16:26:51Z

Hi @XingweiDeng ! Thanks a lot for iterating! I'll have a look in the coming days

XingweiDeng · 2026-03-03T07:29:16Z

Hi @yonigozlan, We have made further updates based on the previous refinements to make the code more compliant with the official Transformers library standards. Please feel free to share any comments or suggestions you may have — we will promptly make revisions and follow up accordingly.

yonigozlan

Hello @XingweiDeng ! Thanks a lot for iterating on this, This looks much better!!
Main comment is that you seem to have quite an outdated branch here, so the first thing to do would be to rebase or merge with main. Then update the model to the latest standards as indicated in my review, and once that's done we shouldn't be too far from ready to merge!
Also let's add a test file for the image processor(s)

yonigozlan · 2026-03-04T19:16:49Z

+        self.interpolate_mode = interpolate_mode
+
+        # ---- Head ----
+        self.k = k


Let's use a better name for this, also is it used anywhere?

yonigozlan · 2026-03-04T19:18:10Z

+def process(
+    logit: np.ndarray,
+    size: np.ndarray,
+    threshold: float,
+    box_thresh: float,
+    unclip_ratio: float,
+    min_size: int,
+    max_candidates: int,
+) -> tuple[Union[list[np.ndarray], np.ndarray], list[float]]:
+    """
+    Main post-processing function to convert model predictions into text boxes.
+
+    Args:
+        logit (torch.Tensor): Model output of shape (1, H, W).
+        size (torch.Tensor): Original image size (height, width).
+        threshold (float): Threshold for binarizing the prediction map.
+        box_thresh (float): Score threshold for filtering boxes.
+        unclip_ratio (float): Expansion ratio for unclipping.
+        min_size (int): Minimum side length of valid boxes.
+        max_candidates (int): Maximum number of boxes to extract.
+
+    Returns:
+        tuple:
+            - boxes (list or np.ndarray): Extracted text boxes.
+            - scores (list): Corresponding confidence scores.
+    """
+    src_height, src_width = size
+    mask = logit > threshold
+    boxes, scores = boxes_from_bitmap(logit, mask, src_width, src_height, box_thresh, unclip_ratio, min_size, max_candidates)
+    return boxes, scores


I don't think we need a separate method for this, let's unroll it directly in the post_process method

yonigozlan · 2026-03-04T19:19:35Z

+        outputs,
+        threshold: float = 0.3,
+        target_sizes: Optional[Union[list[tuple[int, int]], torch.Tensor]] = None,
+        box_thresh: float = 0.6,
+        max_candidates: int = 1000,
+        min_size: int = 3,
+        unclip_ratio: float = 1.5,
+    ):
+        """
+        Converts model outputs into detected text boxes.
+
+        Args:
+            preds (torch.Tensor): Model outputs.
+            target_sizes (TensorType or list[tuple]): Original image sizes.
+            threshold (float): Binarization threshold.
+            box_thresh (float): Box score threshold.
+            max_candidates (int): Maximum number of boxes.
+            min_size (int): Minimum box size.
+            unclip_ratio (float): Expansion ratio.
+
+        Returns:
+            list[dict]: List of detection results.
+        """
+
+        results = []
+        for logit, size in zip(outputs.logits, target_sizes):
+            box, score = process(
+                logit=logit[0, :, :].cpu().detach().numpy(),
+                size=size.cpu().detach().numpy(),
+                threshold=threshold,
+                box_thresh=box_thresh,
+                unclip_ratio=unclip_ratio,
+                min_size=min_size,
+                max_candidates=max_candidates,
+            )
+            results.append({"scores": score, "boxes": box})
+        return results
+
+    def get_image_size(
+        self,
+        image: np.ndarray,
+        limit_side_len: int,
+        limit_type: str,
+        max_side_limit: int = 4000,
+    ) -> tuple[dict, np.ndarray]:
+        """
+        Computes the target size for resizing an image while preserving aspect ratio.
+
+        Args:
+            image (torch.Tensor): Input image.
+            limit_side_len (int): Maximum or minimum side length.
+            limit_type (str): Resizing strategy: "max", "min", or "resize_long".
+            max_side_limit (int): Maximum allowed side length.
+
+        Returns:
+            tuple:
+                - SizeDict: Target size.
+                - torch.Tensor: Original size.
+        """
+        limit_side_len = limit_side_len or self.limit_side_len
+        limit_type = limit_type or self.limit_type
+        height, width, _ = image.shape
+
+        if limit_type == "max":
+            if max(height, width) > limit_side_len:
+                if height > width:
+                    ratio = float(limit_side_len) / height
+                else:
+                    ratio = float(limit_side_len) / width
+            else:
+                ratio = 1.0
+        elif limit_type == "min":
+            if min(height, width) < limit_side_len:
+                if height < width:
+                    ratio = float(limit_side_len) / height
+                else:
+                    ratio = float(limit_side_len) / width
+            else:
+                ratio = 1.0
+        elif limit_type == "resize_long":
+            ratio = float(limit_side_len) / max(height, width)
+        else:
+            raise Exception("not support limit type, image ")
+        resize_height = int(height * ratio)
+        resize_width = int(width * ratio)
+
+        if max(resize_height, resize_width) > max_side_limit:
+            ratio = float(max_side_limit) / max(resize_height, resize_width)
+            resize_height, resize_width = int(resize_height * ratio), int(resize_width * ratio)
+
+        resize_height = max(int(round(resize_height / 32) * 32), 32)
+        resize_width = max(int(round(resize_width / 32) * 32), 32)
+
+        if resize_height == height and resize_width == width:
+            return {"height": resize_height, "width": resize_width}, np.array([height, width])
+
+        if int(resize_width) <= 0 or int(resize_height) <= 0:
+            return None, (None, None)
+
+        return {"height": resize_height, "width": resize_width}, np.array([height, width])


If the results obtained with the fast image processors are similar enough, we can completely discard the pil based processor and only keep the fast one

yonigozlan · 2026-03-04T19:23:05Z

+    limit_side_len = 960
+    limit_type = "max"
+    max_side_limit = 4000


Let's add these 3 attributes to the supported kwargs for this model, so that they can be provided at call time and init time.

You can have a look at other image processor fast in the library (like llava-next) to see how that should be done

yonigozlan · 2026-03-04T19:23:41Z

+    def post_process_object_detection(
+        self,
+        outputs,
+        threshold: float = 0.3,
+        target_sizes: Optional[Union[list[tuple[int, int]], torch.Tensor]] = None,
+        box_thresh: float = 0.6,
+        max_candidates: int = 1000,
+        min_size: int = 3,
+        unclip_ratio: float = 1.5,
+    ):
+        """
+        Converts model outputs into detected text boxes.
+
+        Args:
+            preds (torch.Tensor): Model outputs.
+            threshold (float):Binarization threshold.
+            target_sizes (TensorType or list[tuple]): Original image sizes.
+            box_thresh (float): Box score threshold.
+            max_candidates (int): Maximum number of boxes.
+            min_size (int): Minimum box size.
+            unclip_ratio (float): Expansion ratio.
+
+        Returns:
+            list[dict]: List of detection results.
+        """
+
+        results = []
+        for logit, size in zip(outputs.logits, target_sizes):
+            box, score = process(
+                logit=logit[0, :, :].cpu().detach().numpy(),
+                size=size.cpu().detach().numpy(),
+                threshold=threshold,
+                box_thresh=box_thresh,
+                unclip_ratio=unclip_ratio,
+                min_size=min_size,
+                max_candidates=max_candidates,
+            )
+
+            results.append(
+                {
+                    "boxes": box,
+                    "scores": score,
+                }
+            )
+        return results


Let's put this method at the bottom of the module

yonigozlan · 2026-03-04T21:15:38Z

+    and returns outputs compatible with the Transformers object detection API.
+    """
+
+    _keys_to_ignore_on_load_missing = ["num_batches_tracked"]


Is this key present in the original checkpoint? otherwise we can remove this

yonigozlan · 2026-03-04T21:15:56Z

+    def forward(
+        self,
+        pixel_values: torch.FloatTensor,
+        labels: Optional[list[dict]] = None,


let's remove labels if it's not used

yonigozlan · 2026-03-04T21:16:40Z

+        Returns:
+            Union[tuple[torch.FloatTensor], PPOCRV5MobileDetForObjectDetectionOutput]: Detection output containing
+                segmentation logits, last hidden state, and optional hidden states.
+        """


Fully remove and use auto_docstring

Suggested change

"""

yonigozlan · 2026-03-04T21:18:24Z

+class PPOCRV5MobileDetForObjectDetection(PPOCRV5MobileDetPreTrainedModel):
+    """
+    PPOCRV5 Mobile Det model for object (text) detection tasks. Wraps the core PPOCRV5MobileDetModel
+    and returns outputs compatible with the Transformers object detection API.
+    """
+
+    _keys_to_ignore_on_load_missing = ["num_batches_tracked"]
+
+    def __init__(self, config: PPOCRV5MobileDetConfig):
+        """
+        Initialize the PPOCRV5MobileDetForObjectDetection with the specified configuration.
+
+        Args:
+            config (PPOCRV5MobileDetConfig): Configuration object containing all model hyperparameters.
+        """
+        super().__init__(config)
+        self.model = PPOCRV5MobileDetModel(config)
+        self.post_init()
+
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor,
+        labels: Optional[list[dict]] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs,
+    ) -> Union[tuple[torch.FloatTensor], PPOCRV5MobileDetForObjectDetectionOutput]:
+        """
+        Forward pass of the PPOCRV5MobileDetForObjectDetection model, processing input images to generate
+        text detection logits.
+
+        Args:
+            pixel_values (torch.FloatTensor): Input image tensor of shape (B, 3, H, W) (preprocessed pixel values).
+            labels (list[dict], optional): Unused placeholder for training (object detection labels). Defaults to None.
+            output_hidden_states (bool, optional): Whether to return all intermediate hidden states from the backbone.
+                If None, uses the configuration's `output_hidden_states` value.
+            return_dict (bool, optional): Whether to return a `PPOCRV5MobileDetForObjectDetectionOutput` object or a tuple.
+                If None, uses the configuration's `use_return_dict` value.
+            **kwargs: Additional unused keyword arguments for compatibility.
+
+        Returns:
+            Union[tuple[torch.FloatTensor], PPOCRV5MobileDetForObjectDetectionOutput]: Detection output containing
+                segmentation logits, last hidden state, and optional hidden states.
+        """
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.model(pixel_values, output_hidden_states=output_hidden_states, return_dict=return_dict)
+
+        if not return_dict:
+            output = (outputs[0],)
+            if output_hidden_states:
+                output += (outputs[1], outputs[2])
+            else:
+                output += (outputs[1],)
+
+            return output
+
+        return PPOCRV5MobileDetForObjectDetectionOutput(
+            logits=outputs.logits,
+            last_hidden_state=outputs.last_hidden_state,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+        )


It looks like PPOCRV5MobileDetForObjectDetection PPOCRV5MobileDetModel actually return the same thing. Would it make sense to remove the head from PPOCRV5MobileDetModel and have it only in PPOCRV5MobileDetForObjectDetection ? Could PPOCRV5MobileDetModel be of any use then? (For custom heads for example)

vasqu

Great work! There are a few smaller things but nothing major imo. Let's wrap it up tomorrow then 🤗

vasqu · 2026-03-12T19:43:53Z

+image_processor = AutoImageProcessor.from_pretrained(model_path).to(model.device)
+
+image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/img_rot180_demo.jpg", stream=True).raw)
+inputs = image_processor(images=[image, image], return_tensors="pt")


Suggested change

inputs = image_processor(images=[image, image], return_tensors="pt")

inputs = image_processor(images=[image, image], return_tensors="pt").to(model.device)

Oh, seeing that the image processor is moved to the device here. Let's sync the examples at least to either do that or the other version

vasqu · 2026-03-12T19:45:55Z

+from transformers import AutoImageProcessor, AutoModelForObjectDetection
+
+model_path="PaddlePaddle/PP-OCRv5_mobile_det_safetensors"
+model = AutoModelForObjectDetection.from_pretrained(model_path)


Let's add devic_map here as well (with either version with device movement for the processor input)

vasqu · 2026-03-12T19:46:01Z

+from transformers import AutoImageProcessor, AutoModelForObjectDetection
+
+model_path="PaddlePaddle/PP-OCRv5_mobile_det_safetensors"
+model = AutoModelForObjectDetection.from_pretrained(model_path)


vasqu · 2026-03-12T19:49:59Z

+    @filter_output_hidden_states
+    @can_return_tuple


Suggested change

@filter_output_hidden_states

@can_return_tuple

@can_return_tuple

@filter_output_hidden_states

super nitpicky, but let's keep the same order as elsewhere

vasqu · 2026-03-12T19:50:14Z

+        >>> feature_maps = outputs.feature_maps
+        >>> list(feature_maps[-1].shape)
+        ```"""
+        kwargs["output_hidden_states"] = True


Suggested change

kwargs["output_hidden_states"] = True

kwargs["output_hidden_states"] = True # required to extract layers for the stages

vasqu · 2026-03-12T20:01:58Z

+        return fused_feature_map
+
+
+class PPOCRV5MobileDetHead(nn.Module):


Reopenin this one - can we inherit somehow form the server head or similar?

vasqu · 2026-03-12T20:05:36Z

+        interpolation: Optional["tvF.InterpolationMode"],
+        **kwargs,
+    ) -> BatchFeature:
+        requires_backends(self, ["torch"])


That's on me, I think we can directly move on top of the image processor - there is some decorator @requires(backends=("torch")) should be it. I think we should do the same for the server image processor iirc - missed that in the other PR

Better late than never, I’ve just added it for server_det as well.

vasqu · 2026-03-12T20:06:19Z

@@ -0,0 +1,121 @@
+# coding = utf-8
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.


Suggested change

# Copyright 2025 The HuggingFace Inc. team. All rights reserved.

# Copyright 2026 The HuggingFace Inc. team. All rights reserved.

slipped through, too many files :D

sry, I will check all files

vasqu · 2026-03-12T20:07:00Z

+    @is_flaky()
+    def test_batching_equivalence(self, atol=5e-2, rtol=5e-2):
+        super().test_batching_equivalence(atol=atol, rtol=rtol)
+


Oh is this needed? Would like to avoid but no problem if not

…_det

zhang-prog · 2026-03-13T11:46:52Z

@vasqu I’ve finished the changes, PTAL. You can aslo run slow-test. 🤗

vasqu · 2026-03-13T12:03:05Z

Just pushed this, still something to fix on our side that this is moved but this was missing

vasqu

My last comments now! Quick fixed something else on pp lcnet re modular but should be good now

vasqu · 2026-03-13T12:09:44Z

+        return fused_feature_map
+
+
+class PPOCRV5MobileDetHead(nn.Module):


Can we use PPOCRV5ServerDetSegmentationHead to inherit from modular here? Structure seems very similar

vasqu · 2026-03-13T12:10:09Z

+        if getattr(self.config, "output_hidden_states", False):  # get output_hidden_states from config
+            kwargs["output_hidden_states"] = True


Suggested change

if getattr(self.config, "output_hidden_states", False): # get output_hidden_states from config

kwargs["output_hidden_states"] = True

not needed, it is forced within the backbone either way

Without this, test_hidden_states_output would fail. Cause I found that simply setting config.output_hidden_states = Truein def test_hidden_states_output doesn’t propagate to the backbone’s config. Do you have any suggestions for a better way to handle this?

Check out def _set_subconfig_attributes in test_modeling_common - it recursively sets it to all subconfigs

it works, thanks.

vasqu · 2026-03-13T12:11:58Z



 @auto_docstring
+@requires(backends=("torch",))


Thanks 🤗

vasqu · 2026-03-13T12:12:24Z

+    # @is_flaky()
+    # def test_batching_equivalence(self, atol=5e-2, rtol=5e-2):
+    #     super().test_batching_equivalence(atol=atol, rtol=rtol)


Suggested change

# @is_flaky()

# def test_batching_equivalence(self, atol=5e-2, rtol=5e-2):

# super().test_batching_equivalence(atol=atol, rtol=rtol)

vasqu · 2026-03-13T12:14:56Z

run-slow: pp_lcnet, pp_lcnet_v3, pp_ocrv5_mobile_det

vasqu · 2026-03-13T12:15:30Z

@zhang-prog running the slow tests meanwhile, you can push changes then after the results

github-actions · 2026-03-13T12:16:16Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/pp_lcnet", "models/pp_lcnet_v3", "models/pp_ocrv5_mobile_det"]
quantizations: []

github-actions · 2026-03-13T12:32:58Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	3d315365	workflow commit (merge commit)
PR	05f75611	branch commit (from PR)
main	3dd82faf	base commit (on `main`)

Model CI Report

❌ 1 new failed tests from this PR 😭

pp_lcnet:
tests/models/pp_lcnet/test_modeling_pp_lcnet.py::PPLCNetModelIntegrationTest::test_inference_image_classification_head (✅ ⟹ ❌)

github-actions · 2026-03-13T14:30:31Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, pp_lcnet, pp_lcnet_v3, pp_ocrv5_mobile_det, pp_ocrv5_server_det

zhang-prog · 2026-03-13T14:30:46Z

@vasqu PTAL. please run-slow: pp_lcnet agagin

vasqu · 2026-03-13T14:31:15Z

run-slow: pp_lcnet, pp_lcnet_v3, pp_ocrv5_mobile_det

github-actions · 2026-03-13T14:32:33Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/pp_lcnet", "models/pp_lcnet_v3", "models/pp_ocrv5_mobile_det"]
quantizations: []

github-actions · 2026-03-13T14:43:50Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	50a01fe0	workflow commit (merge commit)
PR	7ac220e9	branch commit (from PR)
main	2548d0db	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

vasqu · 2026-03-13T14:48:32Z

Some weird errors on CI, checking again but would merge if it turns green - don't think it's related to you

HuggingFaceDocBuilderDev · 2026-03-13T15:03:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zhang-prog · 2026-03-13T15:17:07Z

@vasqu Thank you so much for your thorough reviews. Your feedback has been incredibly professional and efficient, and it’s been a pleasure collaborating with you.

Looking ahead, our team hope merging five models next week (before March 20th):
PP-OCRv5_server_rec, PP-OCRv5_mobile_rec, PP-Chart2Table, UVDoc, and SLANeXt.

These PRs will be created by other members of my team, and I will be involved to ensure the code meets your standards, aiming to make the review process as smooth as possible for you.

I was wondering if you would be available to help us review them as well.

Looking forward to your reply.

Thanks again! 🤗

vasqu · 2026-03-13T15:25:00Z

@zhang-prog @XingweiDeng Thanks a lot for all these iterations and pulling through, has been a pleasure as well from my side! 🫡 Also congrats, that's a lot of models in a short timespan already :)

I will definitely check around, I also raised some discussion internally so that other members in the team can help out (cc @yonigozlan @zucchini-nlp @ArthurZucker @Cyrilvallez). In any case, you can always ping me on slack on our shared slack channel.

Let us know if there is anything else we can do 🤗

zhang-prog · 2026-03-13T15:31:43Z

@vasqu This is great. We’ll be in touch next week once our PRs are ready for your review.
Hope you have a great weekend!

vasqu · 2026-03-13T15:35:51Z

Same, have a great weekend and rest up for next week 🤗

ydshieh · 2026-03-14T07:19:32Z

Hi @XingweiDeng and @zhang-prog

Thank you for this work 🚀 !

I saw there is

checkpoint="PaddlePaddle/Not_yet_released",

in PPLCNetV3Config

and the test file

transformers/models/pp_lcnet_v3/configuration_pp_lcnet_v3.py

don't have any integration test.

Look forward for them being added once the checkpoint is released 🙏 .

XingweiDeng added 6 commits January 13, 2026 11:25

Feat: Add PP-OCRV5_mobile_det model

25c2125

fix code

4b9b61b

fix code

5bc2023

fix

cc5cda7

use cv and np to replace pyclipper

34cfc5f

add model post_init()

4570111

NielsRogge mentioned this pull request Jan 13, 2026

is there any open model that can extract data from document with bounding boxes ? NielsRogge/Transformers-Tutorials#475

Open

yonigozlan added the New model label Jan 13, 2026

fix

f00d4c4

yonigozlan reviewed Feb 6, 2026

View reviewed changes

XingweiDeng added 4 commits February 9, 2026 15:39

fix model init_weight

074174c

fix rename module

8a38c71

update

b7f404b

update

bcf332f

update

2727519

update

b23550f

XingweiDeng added 7 commits February 27, 2026 11:03

update

8790263

update

b5c9a63

update

cdc64ed

update

be7e631

update

e469bf1

update

4e6e3ac

update

376a3e0

yonigozlan reviewed Mar 4, 2026

View reviewed changes

XingweiDeng added 2 commits March 6, 2026 09:23

merge main

bc1ed85

update

431a1a5

vasqu approved these changes Mar 12, 2026

View reviewed changes

zhang-prog added 2 commits March 13, 2026 19:38

fix

62f1a7b

Merge remote-tracking branch 'origin/main' into feat/pp_ocr_v5_mobile…

dd703a9

…_det

fixup modular

05f7561

vasqu reviewed Mar 13, 2026

View reviewed changes

vasqu approved these changes Mar 13, 2026

View reviewed changes

fix

7ac220e

Merge branch 'main' into feat/pp_ocr_v5_mobile_det

6c60d4e

fix docs

57200c2

vasqu enabled auto-merge March 13, 2026 15:05

vasqu added this pull request to the merge queue Mar 13, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 13, 2026

vasqu merged commit 036340b into huggingface:main Mar 13, 2026
28 checks passed

	inputs = image_processor(images=[image, image], return_tensors="pt")
	inputs = image_processor(images=[image, image], return_tensors="pt").to(model.device)

	kwargs["output_hidden_states"] = True
	kwargs["output_hidden_states"] = True # required to extract layers for the stages

		return fused_feature_map


		class PPOCRV5MobileDetHead(nn.Module):

		@@ -0,0 +1,121 @@
		# coding = utf-8
		# Copyright 2025 The HuggingFace Inc. team. All rights reserved.

	# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
	# Copyright 2026 The HuggingFace Inc. team. All rights reserved.

		if getattr(self.config, "output_hidden_states", False): # get output_hidden_states from config
		kwargs["output_hidden_states"] = True

	# @is_flaky()
	# def test_batching_equivalence(self, atol=5e-2, rtol=5e-2):
	# super().test_batching_equivalence(atol=atol, rtol=rtol)



		@auto_docstring
		@requires(backends=("torch",))

Conversation

XingweiDeng commented Jan 13, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

XingweiDeng commented Feb 14, 2026

Uh oh!

XingweiDeng commented Feb 25, 2026

Uh oh!

yonigozlan commented Feb 26, 2026

Uh oh!

XingweiDeng commented Mar 3, 2026

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhang-prog commented Mar 13, 2026