Skip to content

[Model] Add PP-OCRV5_mobile_rec Model Support#43793

Closed
liu-jiaxuan wants to merge 5 commits intohuggingface:mainfrom
liu-jiaxuan:feat/pp_ocrv5_mobile_rec
Closed

[Model] Add PP-OCRV5_mobile_rec Model Support#43793
liu-jiaxuan wants to merge 5 commits intohuggingface:mainfrom
liu-jiaxuan:feat/pp_ocrv5_mobile_rec

Conversation

@liu-jiaxuan
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Copy Markdown
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @liu-jiaxuan! Thanks for opening this PR, however there is quite a bit to change here to fit the standards of the Transformers library.

The biggest issue is that you've written everything from scratch without inheriting from existing models. The modular file should maximize inheritance. Even if this is a novel architecture (especially the Conv modules part, which might not exist elsewhere in the library), components like MLP blocks, attention, and layer norms should use standard library patterns by inheriting form an existing model's module in modular.

The novel modules that can't be inherited through modular should also follow library standards in terms of naming, formatting, structure and good-practices ("PPOCRV5MobileRec" prefix for all module names, weight names standardized with other similar modules in the library, no single letter variables, type hints, docstrings when args are not standards or obvious, never use "eval()" etc.), and the model should support as much transformers features as possible, such as the attention interface through flags in PreTrainedModel( _supports_attention_backend, _supports_sdpa, _supports_flash_attn etc.)

Some other big things wrong or missing:

  • We shouldn't have a cv2 dependency in image processors, "slow" should use PiL/numpy functions, fast torch/torchvision.
  • Weight initialization shouldn't be scattered in individual module constructors but centralized in _init_weights() on the PreTrainedModel class, and use the transfromers "init" module.
  • Attention modules are standardized across models in the transformers library, so using modular for attention modules is a must.

Before we go deeper in reviewing this new model addition (and other Paddle Paddle ones open recently that are very similar), please have a good look at how other models are implemented in the library. Notably, you can have a look at the recently merged PP-DocLayoutV3 PR (here's its modular file.
We also have resources to learn more about how to contribute a new model and how to use modular: Contributing a new model, using modular.

Also as the multiple Paddle Paddle models that have a new model addition PR open currently seem to be quite similar, I'd recommend focusing on one (the simplest) for now, then we'll be able to leverage modular to easily add the other models.

Happy to answer any questions you may have!

Comment thread src/transformers/models/pp_ocrv5_mobile_rec/modular_pp_ocrv5_mobile_rec.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, pp_ocrv5_mobile_rec

@liu-jiaxuan
Copy link
Copy Markdown
Contributor Author

Hello @yonigozlan , thank you very much for your detailed review and valuable guidance!
We have revised the three models (pp_ocrv5_mobile_rec, pp_ocrv5_server_rec, and slanext) to address the issues you mentioned.

Specifically, we have implemented the following improvements:

  1. Removed cv2 dependency: We replaced cv2 in image preprocessing and switched to numpy-based image processing. However, since all three models are designed for text or table recognition tasks and are highly sensitive to pixel-level perturbations in images, replacing cv2 operations has caused some impact on model accuracy.
  2. Centralized weight initialization: We moved all weight initialization to the _init_weights() method of the PreTrainedModel class, and removed the initialization code from the constructors of individual modules.
  3. Refactored inheritance and naming: Modified classes that could be modularly inherited to extend existing models, removed standalone implementations for modules that PyTorch supports directly, and added model prefixes (e.g., PPOCRV5MobileRec) to classes that cannot be directly inherited in a modular fashion.
  4. Removed unused functionality from various model modules, such as the DropPath class you pointed out.

We will continue to refine the code according to the transformers library standards and conventions. Please let us know if you have any further comments or suggestions.
Thank you again for your help!

@yonigozlan
Copy link
Copy Markdown
Member

Thanks a lot for iterating @liu-jiaxuan ! I'll have a look in the coming days.

We replaced cv2 in image preprocessing and switched to numpy-based image processing. However, since all three models are designed for text or table recognition tasks and are highly sensitive to pixel-level perturbations in images, replacing cv2 operations has caused some impact on model accuracy.

We do support using PIL.resize in slow processor and torchvision resize in fast, maybe you'll get closer results with this when choosing the equivalent interpolation than the one in cv2 than with custom numpy code?

@liu-jiaxuan
Copy link
Copy Markdown
Contributor Author

Thanks a lot for iterating @liu-jiaxuan ! I'll have a look in the coming days.

We replaced cv2 in image preprocessing and switched to numpy-based image processing. However, since all three models are designed for text or table recognition tasks and are highly sensitive to pixel-level perturbations in images, replacing cv2 operations has caused some impact on model accuracy.

We do support using PIL.resize in slow processor and torchvision resize in fast, maybe you'll get closer results with this when choosing the equivalent interpolation than the one in cv2 than with custom numpy code?

Hi @yonigozlan, thank you very much for your suggestion! Based on our current experiments, using PIL/torchvision for image preprocessing in these three models results in a larger accuracy loss compared to numpy. Therefore, we have chosen the numpy-based approach. We will continue to iterate on the PIL/torchvision-based preprocessing method, and we will update the PR immediately if we achieve a better version.

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Mar 19, 2026

Closing in favor of #44808

@vasqu vasqu closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants