Skip to content

add Qianfan-OCR model definition#45280

Merged
vasqu merged 40 commits intohuggingface:mainfrom
marvinzh:add-qianfan-ocr
Apr 17, 2026
Merged

add Qianfan-OCR model definition#45280
vasqu merged 40 commits intohuggingface:mainfrom
marvinzh:add-qianfan-ocr

Conversation

@marvinzh
Copy link
Copy Markdown
Contributor

@marvinzh marvinzh commented Apr 7, 2026

What does this PR do?

add Qianfan-OCR model definition

  • QianfanOCRForConditionalGeneration - image-text to text model definition
  • QianfanOCRModel - backbone of image-text to text model without lm heads
  • QianfanOCRProcessor - text and image preprocessor
  • QianfanOCRVisionModel - vision transformers used in Qianfan-OCR model

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

Multimodal LLM checklist

  • Modular file: modular_<model_name>.py implemented and verified with python utils/modular_model_converter.py <model_name>
  • Image processors: Torchvision backend (<Model>ImageProcessor from TorchvisionBackend) and PIL backend (<Model>ImageProcessorPil from PilBackend) both implemented (see IMAGE_PROCESSOR_REFACTORING_GUIDE.md)
  • Conversion script: convert_<model_name>_to_hf.py added with usage examples
  • Integration tests: End-to-end tests with exact output matching (text or logits)
  • Documentation: Model docs added/updated in docs/source/en/model_doc/
  • Pattern reuse: Verified against similar models (LLaVA, Idefics2, etc.)
  • Quality checks: make style passes with no errors

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@zucchini-nlp @vasqu

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Prob PR is not yet ready so just leaving a super early review. Quickly skimmed over the model and left comments about where we can copy from for some modules. Looks like all files can be put entirely in modular, there is a lot of copying going on in config and processor as well

Comment on lines +58 to +63
ORIGINAL_TO_CONVERTED_KEY_MAPPING_VISION = {
# Top-level prefix: vision_model.* → model.vision_tower.*
r"^vision_model\.": r"model.vision_tower.",
# Encoder layer list: encoder.layers.N → encoder.layer.N
r"encoder\.layers\.": r"encoder.layer.",
# NOTE: class_embedding, patch_embedding, position_embedding keep their
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob we could do this with conversion_mapping and apply the rest of changes to config/tokenizer directly on hub repo?

Copy link
Copy Markdown
Contributor Author

@marvinzh marvinzh Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the advice, yes I found the conversion_mapping are very helpful tools that convert naming between safetensors under different naming schema. Also, I felt there might be some outdated information in https://huggingface.co/docs/transformers/main/en/contributing that misleads me and probably others who contribute new VLM model for the first time as well.

would you mind if I raise another PR to update the documentation as well, specifically the VLM contribution checklist part, which is quite different from the contribution process now.

Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment on lines +79 to +80
self.lambda_1 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True)
self.lambda_2 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the biggest diff from existing models seems to be here? Do we need to apply it in forward as a separate param, or could it be fused with prev proj layers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the vision layer, the different part from InternVLVisionLayer is the drop_path layers, I have updated the definition in modular file to make this class inherit from its InternVL counterparts and remove other redundant definition to make use of existing model. as for the two layer scale term you commented, I think it's identical to what we already have in existing InternVL model definition.

Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/processing_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/image_processing_pil_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/configuration_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/configuration_qianfan_ocr.py
Comment thread utils/check_repo.py Outdated
@marvinzh
Copy link
Copy Markdown
Contributor Author

marvinzh commented Apr 9, 2026

hi @zucchini-nlp thanks for taking time to review this PR and sorry for previous broken PR that was sent out before review it locally by running CI checks. I have updated PR according to your comments and specifically:

  • moved config/processor inside modular file so that make the best use of existing implementation
  • refactor existing implementation in modular file to make use of existing implementation
  • fix all the CI errors and test it locally before sending this PR out. there still some pending checks work in progress, will keep an eye on it.

please let me know if there are anything I can do to make it better, thanks

@marvinzh marvinzh requested a review from zucchini-nlp April 10, 2026 06:02
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, much much cleaner! I want to push a bit more to use modular because there are a few modules that look identical to me. Left comments below

A core maintainer will pass by next week for final review :)

Comment thread docs/source/en/model_doc/qianfan_ocr.md Outdated
Comment thread docs/source/en/model_doc/qianfan_ocr.md Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment on lines +236 to +239
base_h = self.image_size[0] // patch_size[0]
base_w = self.image_size[1] // patch_size[1]
new_h = height // patch_size[0]
new_w = width // patch_size[1]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh looks very much same as InternVL, or does Qianfan have non-square image_size? In any case, can you add the major diff as a tiny comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the initial idea is to keep this for future compatibility. however, there are only squared image size in our current released model. let me update the implementation and add it back when we release non-square patch in the future

Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment on lines +384 to +387
try:
target_dtype = next(self.vision_tower.parameters()).dtype
except StopIteration:
target_dtype = pixel_values.dtype
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be needed because VisionModel casts is internally

self.projection(pixel_values.to(self.projection.weight.dtype))

And if the rest is same, we can delete and let modular copy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, this is actually for DataParallel compatibility, as stated in the previous comment, the current implementation in internvl incurs a bug under multi gpus environment (which can be reproduced under 4090 x2), the self.dtype would iterate over an empty list and thus throw out StopIterationException. I did a small research and found DataParallel is now deprecated, so let's reuse InternVL and put this UT as skip.

Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/conversion_mapping.py Outdated
@marvinzh
Copy link
Copy Markdown
Contributor Author

looks like the failed CI case is due to AI-Sweden-Models/gpt-sw3-126m got removed from HF

@zucchini-nlp
Copy link
Copy Markdown
Member

Rebasing will help, we fixed it yesterday :)

Also requesting review from @vasqu since i suppose PR is mostly modularized by now, I might pass by later

@zucchini-nlp zucchini-nlp requested a review from vasqu April 14, 2026 09:54
@marvinzh
Copy link
Copy Markdown
Contributor Author

looks like CI is blocked by an issue in test_modeling_glm.py, will rebase again tomorrow to see if it get resolved

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 14, 2026

Taking a look in a bit, dw about the CI - looks like a flaky test / something we need to fix on our side

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already looks super good imo, just a lot of details we could further incorporate

1 bigger point might be to use the VLM tester, wdyt @zucchini-nlp?

Comment thread docs/source/en/model_doc/qianfan_ocr.md Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py Outdated
Comment thread tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py Outdated
Comment thread tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py Outdated
Comment thread tests/models/qianfan_ocr/test_processing_qianfan_ocr.py Outdated
Comment thread tests/models/qianfan_ocr/test_processing_qianfan_ocr.py Outdated
@marvinzh
Copy link
Copy Markdown
Contributor Author

thanks for raising constructive comments. I have updated some of the code and please let me know if there are any should be fixed further

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last details, one big thing to change imo: Refactor the tests with our VLMTester instead of the current manual version. Other than that, it's nothing too big imo

Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
return hidden_states


class QianfanOCRVisionEncoder(nn.Module):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reopening, we should not have this Module at all, this should be directly within QianfanOCRVisionModel. You will need to update

  1. The conversion mapping to include a rename WeightRenaming(r"encoder.layers", r"layers")
  2. Move these layer to the parent module

Copy link
Copy Markdown
Contributor Author

@marvinzh marvinzh Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry I thought the previous comment is not on me, so didn't pay attention to it. let's refactor it to eliminate this useless class

Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
Comment thread src/transformers/models/qianfan_ocr/modular_qianfan_ocr.py Outdated
from PIL import Image


class QianfanOCRVisionText2TextModelTester:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor the tests here @marvinzh, e.g.

class Qwen3VLVisionText2TextModelTester(VLMModelTester):
base_model_class = Qwen3VLModel
config_class = Qwen3VLConfig
text_config_class = Qwen3VLTextConfig
vision_config_class = Qwen3VLVisionConfig
conditional_generation_class = Qwen3VLForConditionalGeneration
def __init__(self, parent, **kwargs):
kwargs.setdefault("image_token_id", 3)
kwargs.setdefault("video_token_id", 4)
kwargs.setdefault("vision_start_token_id", 5)
kwargs.setdefault("vision_end_token_id", 6)
kwargs.setdefault("image_size", 16)
kwargs.setdefault("patch_size", 16)
kwargs.setdefault("num_image_tokens", 32)
kwargs.setdefault("hidden_act", "silu")
kwargs.setdefault("num_attention_heads", 4)
kwargs.setdefault("num_key_value_heads", 2)
kwargs.setdefault("head_dim", 8)
kwargs.setdefault("depth", 2)
kwargs.setdefault("vision_hidden_act", "gelu_pytorch_tanh")
kwargs.setdefault("num_heads", 4)
kwargs.setdefault("spatial_merge_size", 1)
kwargs.setdefault("temporal_patch_size", 2)
kwargs.setdefault("num_position_embeddings", 16)
kwargs.setdefault("deepstack_visual_indexes", [0, 1])
kwargs.setdefault(
"rope_parameters",
{
"rope_type": "default",
"mrope_section": [16, 8, 8],
"mrope_interleaved": True,
"rope_theta": 10000,
},
)
super().__init__(parent, **kwargs)
# These can be inferred from existing properties and don't get separate kwargs
self.out_hidden_size = self.hidden_size
self.vision_hidden_size = self.hidden_size
self.vision_intermediate_size = self.hidden_size
def create_pixel_values(self):
# Qwen3VL expects flattened patches: (total_patches, channels * patch_size^2 * temporal_patch_size)
return floats_tensor(
[
self.batch_size * (self.image_size**2) // (self.patch_size**2),
self.num_channels * (self.patch_size**2) * self.temporal_patch_size,
]
)
def place_image_tokens(self, input_ids, config):
# Place image tokens with vision_start_token_id prefix
input_ids = input_ids.clone()
# Clear any accidental special tokens first
input_ids[:, -1] = self.pad_token_id
input_ids[input_ids == self.video_token_id] = self.pad_token_id
input_ids[input_ids == self.image_token_id] = self.pad_token_id
input_ids[input_ids == self.vision_start_token_id] = self.pad_token_id
# Place image tokens with vision_start_token_id prefix
input_ids[:, 1] = self.image_token_id
input_ids[:, 0] = self.vision_start_token_id
return input_ids
def get_additional_inputs(self, config, input_ids, pixel_values):
mm_token_type_ids = torch.zeros_like(input_ids)
mm_token_type_ids[input_ids == self.image_token_id] = 1
return {
"image_grid_thw": torch.tensor([[1, 1, 1]] * self.batch_size, device=torch_device),
"mm_token_type_ids": mm_token_type_ids,
}
def get_config(self):
# Qwen3VLConfig expects text_config and vision_config as dicts, not config objects
return self.config_class(
text_config=self.get_text_config().to_dict(),
vision_config=self.get_vision_config().to_dict(),
image_token_id=self.image_token_id,
video_token_id=self.video_token_id,
vision_start_token_id=self.vision_start_token_id,
vision_end_token_id=self.vision_end_token_id,
tie_word_embeddings=self.tie_word_embeddings,
pad_token_id=self.pad_token_id,
)
@require_torch
class Qwen3VLModelTest(VLMModelTest, unittest.TestCase):
model_tester_class = Qwen3VLVisionText2TextModelTester

This should avoid a lot of manual work

@marvinzh
Copy link
Copy Markdown
Contributor Author

refactored the module to squeeze useless class out and refactored test to use VLMTester. looks like the issue of torch.compile is fine on CI end.

@marvinzh marvinzh requested a review from vasqu April 16, 2026 12:01
@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 16, 2026

Will take a look in a bit!

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed some small last details myself 🤗 will check with our ci (run-slow) in a second

Seems like parts of the CI are unstable, so would likely merge tomorrow (if run-slow passes)

Comment thread docs/source/en/model_doc/qianfan_ocr.md
from ...utils.output_capturing import capture_outputs
from ..auto import CONFIG_MAPPING, AutoConfig
from ..beit.modeling_beit import BeitDropPath
from ..internvl.configuration_internvl import InternVLConfig, InternVLVisionConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a modular bug 92fa1c3

Don't think we need to change anything but would be still nice if you could cross check

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 16, 2026

run-slow: qianfan_ocr

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qianfan_ocr"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 482897d3 workflow commit (merge commit)
PR cfd2a9cc branch commit (from PR)
main 947eff6e base commit (on main)

Model CI Report

3 new failed tests from this PR 😭

  • qianfan_ocr:
    tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py::QianfanOCRIntegrationTest::test_model_integration_batched_generate (✅ ⟹ ❌)
    tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py::QianfanOCRIntegrationTest::test_model_integration_forward (✅ ⟹ ❌)
    tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py::QianfanOCRIntegrationTest::test_model_integration_generate (✅ ⟹ ❌)

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 16, 2026

Ok looking at https://github.com/huggingface/transformers/actions/runs/24520690200 (the workflow run from run-slow), it seems that the integration tests fail

It could very likely be a GPU difference (we use A10 GPUs) so I can adjust the values to that if the model still works as expected (and I didn't destroy anything). Just let me know @marvinzh

Side note: Ci is unstable so dw about those red CIs 😢

@marvinzh
Copy link
Copy Markdown
Contributor Author

marvinzh commented Apr 17, 2026

Hi @vasqu thanks for the comments and approval!

currently we use the outputs for calibration under 4090 (cu127), as we do not have access to A10 GPUs, please help adjust the output under your environments, thanks

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, qianfan_ocr

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 17, 2026

run-slow: qianfan_ocr

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qianfan_ocr"]
quantizations: []

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN a33a5c91 workflow commit (merge commit)
PR 2fbd0d7a branch commit (from PR)
main ff4f96a7 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45280&sha=2fbd0d

@vasqu vasqu merged commit 77de8dd into huggingface:main Apr 17, 2026
27 of 29 checks passed
@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 17, 2026

Thanks for all the iterations, model has now been merged 🤗 @marvinzh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants