add Qianfan-OCR model definition#45280
Conversation
zucchini-nlp
left a comment
There was a problem hiding this comment.
Hey! Prob PR is not yet ready so just leaving a super early review. Quickly skimmed over the model and left comments about where we can copy from for some modules. Looks like all files can be put entirely in modular, there is a lot of copying going on in config and processor as well
| ORIGINAL_TO_CONVERTED_KEY_MAPPING_VISION = { | ||
| # Top-level prefix: vision_model.* → model.vision_tower.* | ||
| r"^vision_model\.": r"model.vision_tower.", | ||
| # Encoder layer list: encoder.layers.N → encoder.layer.N | ||
| r"encoder\.layers\.": r"encoder.layer.", | ||
| # NOTE: class_embedding, patch_embedding, position_embedding keep their |
There was a problem hiding this comment.
prob we could do this with conversion_mapping and apply the rest of changes to config/tokenizer directly on hub repo?
There was a problem hiding this comment.
thanks for the advice, yes I found the conversion_mapping are very helpful tools that convert naming between safetensors under different naming schema. Also, I felt there might be some outdated information in https://huggingface.co/docs/transformers/main/en/contributing that misleads me and probably others who contribute new VLM model for the first time as well.
would you mind if I raise another PR to update the documentation as well, specifically the VLM contribution checklist part, which is quite different from the contribution process now.
| self.lambda_1 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True) | ||
| self.lambda_2 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True) |
There was a problem hiding this comment.
the biggest diff from existing models seems to be here? Do we need to apply it in forward as a separate param, or could it be fused with prev proj layers?
There was a problem hiding this comment.
for the vision layer, the different part from InternVLVisionLayer is the drop_path layers, I have updated the definition in modular file to make this class inherit from its InternVL counterparts and remove other redundant definition to make use of existing model. as for the two layer scale term you commented, I think it's identical to what we already have in existing InternVL model definition.
b3c0164 to
0291780
Compare
|
hi @zucchini-nlp thanks for taking time to review this PR and sorry for previous broken PR that was sent out before review it locally by running CI checks. I have updated PR according to your comments and specifically:
please let me know if there are anything I can do to make it better, thanks |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Great work, much much cleaner! I want to push a bit more to use modular because there are a few modules that look identical to me. Left comments below
A core maintainer will pass by next week for final review :)
| base_h = self.image_size[0] // patch_size[0] | ||
| base_w = self.image_size[1] // patch_size[1] | ||
| new_h = height // patch_size[0] | ||
| new_w = width // patch_size[1] |
There was a problem hiding this comment.
tbh looks very much same as InternVL, or does Qianfan have non-square image_size? In any case, can you add the major diff as a tiny comment
There was a problem hiding this comment.
the initial idea is to keep this for future compatibility. however, there are only squared image size in our current released model. let me update the implementation and add it back when we release non-square patch in the future
| try: | ||
| target_dtype = next(self.vision_tower.parameters()).dtype | ||
| except StopIteration: | ||
| target_dtype = pixel_values.dtype |
There was a problem hiding this comment.
shouldn't be needed because VisionModel casts is internally
self.projection(pixel_values.to(self.projection.weight.dtype))
And if the rest is same, we can delete and let modular copy
There was a problem hiding this comment.
hi, this is actually for DataParallel compatibility, as stated in the previous comment, the current implementation in internvl incurs a bug under multi gpus environment (which can be reproduced under 4090 x2), the self.dtype would iterate over an empty list and thus throw out StopIterationException. I did a small research and found DataParallel is now deprecated, so let's reuse InternVL and put this UT as skip.
0291780 to
cf8b9b9
Compare
|
looks like the failed CI case is due to |
|
Rebasing will help, we fixed it yesterday :) Also requesting review from @vasqu since i suppose PR is mostly modularized by now, I might pass by later |
cf8b9b9 to
1015cfe
Compare
|
looks like CI is blocked by an issue in |
|
Taking a look in a bit, dw about the CI - looks like a flaky test / something we need to fix on our side |
vasqu
left a comment
There was a problem hiding this comment.
Already looks super good imo, just a lot of details we could further incorporate
1 bigger point might be to use the VLM tester, wdyt @zucchini-nlp?
dbb5438 to
ee8deed
Compare
|
thanks for raising constructive comments. I have updated some of the code and please let me know if there are any should be fixed further |
vasqu
left a comment
There was a problem hiding this comment.
Some last details, one big thing to change imo: Refactor the tests with our VLMTester instead of the current manual version. Other than that, it's nothing too big imo
| return hidden_states | ||
|
|
||
|
|
||
| class QianfanOCRVisionEncoder(nn.Module): |
There was a problem hiding this comment.
Reopening, we should not have this Module at all, this should be directly within QianfanOCRVisionModel. You will need to update
- The conversion mapping to include a rename
WeightRenaming(r"encoder.layers", r"layers") - Move these layer to the parent module
There was a problem hiding this comment.
oh sorry I thought the previous comment is not on me, so didn't pay attention to it. let's refactor it to eliminate this useless class
| from PIL import Image | ||
|
|
||
|
|
||
| class QianfanOCRVisionText2TextModelTester: |
There was a problem hiding this comment.
Let's refactor the tests here @marvinzh, e.g.
transformers/tests/models/qwen3_vl/test_modeling_qwen3_vl.py
Lines 46 to 134 in b6f9463
This should avoid a lot of manual work
ee8deed to
1d218c7
Compare
|
refactored the module to squeeze useless class out and refactored test to use VLMTester. looks like the issue of torch.compile is fine on CI end. |
|
Will take a look in a bit! |
vasqu
left a comment
There was a problem hiding this comment.
Fixed some small last details myself 🤗 will check with our ci (run-slow) in a second
Seems like parts of the CI are unstable, so would likely merge tomorrow (if run-slow passes)
| from ...utils.output_capturing import capture_outputs | ||
| from ..auto import CONFIG_MAPPING, AutoConfig | ||
| from ..beit.modeling_beit import BeitDropPath | ||
| from ..internvl.configuration_internvl import InternVLConfig, InternVLVisionConfig |
There was a problem hiding this comment.
There was a modular bug 92fa1c3
Don't think we need to change anything but would be still nice if you could cross check
|
run-slow: qianfan_ocr |
|
This comment contains models: ["models/qianfan_ocr"] |
CI ResultsCommit Info
Model CI Report❌ 3 new failed tests from this PR 😭
|
|
Ok looking at https://github.com/huggingface/transformers/actions/runs/24520690200 (the workflow run from run-slow), it seems that the integration tests fail It could very likely be a GPU difference (we use A10 GPUs) so I can adjust the values to that if the model still works as expected (and I didn't destroy anything). Just let me know @marvinzh Side note: Ci is unstable so dw about those red CIs 😢 |
|
Hi @vasqu thanks for the comments and approval! currently we use the outputs for calibration under 4090 (cu127), as we do not have access to A10 GPUs, please help adjust the output under your environments, thanks |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, qianfan_ocr |
|
run-slow: qianfan_ocr |
|
This comment contains models: ["models/qianfan_ocr"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45280&sha=2fbd0d |
|
Thanks for all the iterations, model has now been merged 🤗 @marvinzh |
What does this PR do?
add Qianfan-OCR model definition
QianfanOCRForConditionalGeneration- image-text to text model definitionQianfanOCRModel- backbone of image-text to text model without lm headsQianfanOCRProcessor- text and image preprocessorQianfanOCRVisionModel- vision transformers used in Qianfan-OCR modelCode Agent Policy
The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.
PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.
This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read
CONTRIBUTING.md.Before submitting
Pull Request section?
documentation guidelines, and
here are tips on formatting docstrings.
Multimodal LLM checklist
modular_<model_name>.pyimplemented and verified withpython utils/modular_model_converter.py <model_name><Model>ImageProcessorfromTorchvisionBackend) and PIL backend (<Model>ImageProcessorPilfromPilBackend) both implemented (see IMAGE_PROCESSOR_REFACTORING_GUIDE.md)convert_<model_name>_to_hf.pyadded with usage examplesdocs/source/en/model_doc/make stylepasses with no errorsWho can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@zucchini-nlp @vasqu