add Qianfan-OCR model definition by marvinzh · Pull Request #45280 · huggingface/transformers

marvinzh · 2026-04-07T06:49:34Z

What does this PR do?

add Qianfan-OCR model definition

QianfanOCRForConditionalGeneration - image-text to text model definition
QianfanOCRModel - backbone of image-text to text model without lm heads
QianfanOCRProcessor - text and image preprocessor
QianfanOCRVisionModel - vision transformers used in Qianfan-OCR model

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Multimodal LLM checklist

Modular file: modular_<model_name>.py implemented and verified with python utils/modular_model_converter.py <model_name>
Image processors: Torchvision backend (<Model>ImageProcessor from TorchvisionBackend) and PIL backend (<Model>ImageProcessorPil from PilBackend) both implemented (see IMAGE_PROCESSOR_REFACTORING_GUIDE.md)
Conversion script: convert_<model_name>_to_hf.py added with usage examples
Integration tests: End-to-end tests with exact output matching (text or logits)
Documentation: Model docs added/updated in docs/source/en/model_doc/
Pattern reuse: Verified against similar models (LLaVA, Idefics2, etc.)
Quality checks: make style passes with no errors

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@zucchini-nlp @vasqu

zucchini-nlp

Hey! Prob PR is not yet ready so just leaving a super early review. Quickly skimmed over the model and left comments about where we can copy from for some modules. Looks like all files can be put entirely in modular, there is a lot of copying going on in config and processor as well

zucchini-nlp · 2026-04-07T12:00:48Z

+ORIGINAL_TO_CONVERTED_KEY_MAPPING_VISION = {
+    # Top-level prefix: vision_model.* → model.vision_tower.*
+    r"^vision_model\.":                             r"model.vision_tower.",
+    # Encoder layer list: encoder.layers.N → encoder.layer.N
+    r"encoder\.layers\.":                           r"encoder.layer.",
+    # NOTE: class_embedding, patch_embedding, position_embedding keep their


prob we could do this with conversion_mapping and apply the rest of changes to config/tokenizer directly on hub repo?

thanks for the advice, yes I found the conversion_mapping are very helpful tools that convert naming between safetensors under different naming schema. Also, I felt there might be some outdated information in https://huggingface.co/docs/transformers/main/en/contributing that misleads me and probably others who contribute new VLM model for the first time as well.

would you mind if I raise another PR to update the documentation as well, specifically the VLM contribution checklist part, which is quite different from the contribution process now.

zucchini-nlp · 2026-04-07T12:03:22Z

+        self.lambda_1 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True)
+        self.lambda_2 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True)


the biggest diff from existing models seems to be here? Do we need to apply it in forward as a separate param, or could it be fused with prev proj layers?

for the vision layer, the different part from InternVLVisionLayer is the drop_path layers, I have updated the definition in modular file to make this class inherit from its InternVL counterparts and remove other redundant definition to make use of existing model. as for the two layer scale term you commented, I think it's identical to what we already have in existing InternVL model definition.

marvinzh · 2026-04-09T15:25:12Z

hi @zucchini-nlp thanks for taking time to review this PR and sorry for previous broken PR that was sent out before review it locally by running CI checks. I have updated PR according to your comments and specifically:

moved config/processor inside modular file so that make the best use of existing implementation
refactor existing implementation in modular file to make use of existing implementation
fix all the CI errors and test it locally before sending this PR out. there still some pending checks work in progress, will keep an eye on it.

please let me know if there are anything I can do to make it better, thanks

zucchini-nlp

Great work, much much cleaner! I want to push a bit more to use modular because there are a few modules that look identical to me. Left comments below

A core maintainer will pass by next week for final review :)

zucchini-nlp · 2026-04-10T09:55:47Z

+        base_h = self.image_size[0] // patch_size[0]
+        base_w = self.image_size[1] // patch_size[1]
+        new_h = height // patch_size[0]
+        new_w = width // patch_size[1]


tbh looks very much same as InternVL, or does Qianfan have non-square image_size? In any case, can you add the major diff as a tiny comment

the initial idea is to keep this for future compatibility. however, there are only squared image size in our current released model. let me update the implementation and add it back when we release non-square patch in the future

zucchini-nlp · 2026-04-10T10:01:11Z

+        try:
+            target_dtype = next(self.vision_tower.parameters()).dtype
+        except StopIteration:
+            target_dtype = pixel_values.dtype


shouldn't be needed because VisionModel casts is internally

self.projection(pixel_values.to(self.projection.weight.dtype))

And if the rest is same, we can delete and let modular copy

hi, this is actually for DataParallel compatibility, as stated in the previous comment, the current implementation in internvl incurs a bug under multi gpus environment (which can be reproduced under 4090 x2), the self.dtype would iterate over an empty list and thus throw out StopIterationException. I did a small research and found DataParallel is now deprecated, so let's reuse InternVL and put this UT as skip.

marvinzh · 2026-04-13T13:34:47Z

looks like the failed CI case is due to AI-Sweden-Models/gpt-sw3-126m got removed from HF

zucchini-nlp · 2026-04-14T09:54:46Z

Rebasing will help, we fixed it yesterday :)

Also requesting review from @vasqu since i suppose PR is mostly modularized by now, I might pass by later

marvinzh · 2026-04-14T12:30:27Z

looks like CI is blocked by an issue in test_modeling_glm.py, will rebase again tomorrow to see if it get resolved

vasqu · 2026-04-14T12:59:46Z

Taking a look in a bit, dw about the CI - looks like a flaky test / something we need to fix on our side

vasqu

Already looks super good imo, just a lot of details we could further incorporate

1 bigger point might be to use the VLM tester, wdyt @zucchini-nlp?

marvinzh · 2026-04-15T10:47:54Z

thanks for raising constructive comments. I have updated some of the code and please let me know if there are any should be fixed further

vasqu

Some last details, one big thing to change imo: Refactor the tests with our VLMTester instead of the current manual version. Other than that, it's nothing too big imo

vasqu · 2026-04-15T11:14:24Z

+        return hidden_states
+
+
+class QianfanOCRVisionEncoder(nn.Module):


Reopening, we should not have this Module at all, this should be directly within QianfanOCRVisionModel. You will need to update

The conversion mapping to include a rename WeightRenaming(r"encoder.layers", r"layers")

Move these layer to the parent module

oh sorry I thought the previous comment is not on me, so didn't pay attention to it. let's refactor it to eliminate this useless class

vasqu · 2026-04-15T11:28:13Z

+    from PIL import Image
+
+
+class QianfanOCRVisionText2TextModelTester:


Let's refactor the tests here @marvinzh, e.g.

transformers/tests/models/qwen3_vl/test_modeling_qwen3_vl.py

Lines 46 to 134 in b6f9463

class Qwen3VLVisionText2TextModelTester(VLMModelTester):

base_model_class = Qwen3VLModel

config_class = Qwen3VLConfig

text_config_class = Qwen3VLTextConfig

vision_config_class = Qwen3VLVisionConfig

conditional_generation_class = Qwen3VLForConditionalGeneration

def __init__(self, parent, **kwargs):

kwargs.setdefault("image_token_id", 3)

kwargs.setdefault("video_token_id", 4)

kwargs.setdefault("vision_start_token_id", 5)

kwargs.setdefault("vision_end_token_id", 6)

kwargs.setdefault("image_size", 16)

kwargs.setdefault("patch_size", 16)

kwargs.setdefault("num_image_tokens", 32)

kwargs.setdefault("hidden_act", "silu")

kwargs.setdefault("num_attention_heads", 4)

kwargs.setdefault("num_key_value_heads", 2)

kwargs.setdefault("head_dim", 8)

kwargs.setdefault("depth", 2)

kwargs.setdefault("vision_hidden_act", "gelu_pytorch_tanh")

kwargs.setdefault("num_heads", 4)

kwargs.setdefault("spatial_merge_size", 1)

kwargs.setdefault("temporal_patch_size", 2)

kwargs.setdefault("num_position_embeddings", 16)

kwargs.setdefault("deepstack_visual_indexes", [0, 1])

kwargs.setdefault(

"rope_parameters",

{

"rope_type": "default",

"mrope_section": [16, 8, 8],

"mrope_interleaved": True,

"rope_theta": 10000,

},

)

super().__init__(parent, **kwargs)

# These can be inferred from existing properties and don't get separate kwargs

self.out_hidden_size = self.hidden_size

self.vision_hidden_size = self.hidden_size

self.vision_intermediate_size = self.hidden_size

def create_pixel_values(self):

# Qwen3VL expects flattened patches: (total_patches, channels * patch_size^2 * temporal_patch_size)

return floats_tensor(

[

self.batch_size * (self.image_size**2) // (self.patch_size**2),

self.num_channels * (self.patch_size**2) * self.temporal_patch_size,

]

)

def place_image_tokens(self, input_ids, config):

# Place image tokens with vision_start_token_id prefix

input_ids = input_ids.clone()

# Clear any accidental special tokens first

input_ids[:, -1] = self.pad_token_id

input_ids[input_ids == self.video_token_id] = self.pad_token_id

input_ids[input_ids == self.image_token_id] = self.pad_token_id

input_ids[input_ids == self.vision_start_token_id] = self.pad_token_id

# Place image tokens with vision_start_token_id prefix

input_ids[:, 1] = self.image_token_id

input_ids[:, 0] = self.vision_start_token_id

return input_ids

def get_additional_inputs(self, config, input_ids, pixel_values):

mm_token_type_ids = torch.zeros_like(input_ids)

mm_token_type_ids[input_ids == self.image_token_id] = 1

return {

"image_grid_thw": torch.tensor([[1, 1, 1]] * self.batch_size, device=torch_device),

"mm_token_type_ids": mm_token_type_ids,

}

def get_config(self):

# Qwen3VLConfig expects text_config and vision_config as dicts, not config objects

return self.config_class(

text_config=self.get_text_config().to_dict(),

vision_config=self.get_vision_config().to_dict(),

image_token_id=self.image_token_id,

video_token_id=self.video_token_id,

vision_start_token_id=self.vision_start_token_id,

vision_end_token_id=self.vision_end_token_id,

tie_word_embeddings=self.tie_word_embeddings,

pad_token_id=self.pad_token_id,

)

@require_torch

class Qwen3VLModelTest(VLMModelTest, unittest.TestCase):

model_tester_class = Qwen3VLVisionText2TextModelTester

This should avoid a lot of manual work

marvinzh · 2026-04-16T08:35:11Z

refactored the module to squeeze useless class out and refactored test to use VLMTester. looks like the issue of torch.compile is fine on CI end.

vasqu · 2026-04-16T13:45:44Z

Will take a look in a bit!

vasqu

Fixed some small last details myself 🤗 will check with our ci (run-slow) in a second

Seems like parts of the CI are unstable, so would likely merge tomorrow (if run-slow passes)

vasqu · 2026-04-16T15:44:16Z

+from ...utils.output_capturing import capture_outputs
+from ..auto import CONFIG_MAPPING, AutoConfig
+from ..beit.modeling_beit import BeitDropPath
+from ..internvl.configuration_internvl import InternVLConfig, InternVLVisionConfig


There was a modular bug 92fa1c3

Don't think we need to change anything but would be still nice if you could cross check

vasqu · 2026-04-16T16:04:23Z

run-slow: qianfan_ocr

github-actions · 2026-04-16T16:06:04Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qianfan_ocr"]
quantizations: []

github-actions · 2026-04-16T16:25:36Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	482897d3	workflow commit (merge commit)
PR	cfd2a9cc	branch commit (from PR)
main	947eff6e	base commit (on `main`)

Model CI Report

❌ 3 new failed tests from this PR 😭

qianfan_ocr:
tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py::QianfanOCRIntegrationTest::test_model_integration_batched_generate (✅ ⟹ ❌)
tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py::QianfanOCRIntegrationTest::test_model_integration_forward (✅ ⟹ ❌)
tests/models/qianfan_ocr/test_modeling_qianfan_ocr.py::QianfanOCRIntegrationTest::test_model_integration_generate (✅ ⟹ ❌)

vasqu · 2026-04-16T16:25:50Z

Ok looking at https://github.com/huggingface/transformers/actions/runs/24520690200 (the workflow run from run-slow), it seems that the integration tests fail

It could very likely be a GPU difference (we use A10 GPUs) so I can adjust the values to that if the model still works as expected (and I didn't destroy anything). Just let me know @marvinzh

Side note: Ci is unstable so dw about those red CIs 😢

marvinzh · 2026-04-17T09:26:11Z

Hi @vasqu thanks for the comments and approval!

currently we use the outputs for calibration under 4090 (cu127), as we do not have access to A10 GPUs, please help adjust the output under your environments, thanks

github-actions · 2026-04-17T11:55:31Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, qianfan_ocr

vasqu · 2026-04-17T12:22:49Z

run-slow: qianfan_ocr

github-actions · 2026-04-17T12:24:21Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/qianfan_ocr"]
quantizations: []

HuggingFaceDocBuilderDev · 2026-04-17T12:32:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-04-17T12:35:09Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	a33a5c91	workflow commit (merge commit)
PR	2fbd0d7a	branch commit (from PR)
main	ff4f96a7	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

github-actions · 2026-04-17T12:37:50Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45280&sha=2fbd0d

vasqu · 2026-04-17T12:38:23Z

Thanks for all the iterations, model has now been merged 🤗 @marvinzh

zucchini-nlp reviewed Apr 7, 2026

View reviewed changes

marvinzh force-pushed the add-qianfan-ocr branch from b3c0164 to 0291780 Compare April 9, 2026 13:44

marvinzh requested a review from zucchini-nlp April 10, 2026 06:02

zucchini-nlp reviewed Apr 10, 2026

View reviewed changes

marvinzh force-pushed the add-qianfan-ocr branch from 0291780 to cf8b9b9 Compare April 13, 2026 13:13

marvinzh requested a review from zucchini-nlp April 13, 2026 13:30

zucchini-nlp requested a review from vasqu April 14, 2026 09:54

marvinzh force-pushed the add-qianfan-ocr branch from cf8b9b9 to 1015cfe Compare April 14, 2026 11:01

vasqu reviewed Apr 14, 2026

View reviewed changes

marvinzh force-pushed the add-qianfan-ocr branch from dbb5438 to ee8deed Compare April 15, 2026 10:02

vasqu reviewed Apr 15, 2026

View reviewed changes

zhuangbairong and others added 13 commits April 16, 2026 16:00

add QianfanOCR Model

34694a2

update remote model id

08fe5fa

update model attributes

14c9187

fix errors in ddp

d4ec4c0

fix dtype mismatch

4e3060a

fix template error

c187181

fix style

c098b82

fix code style

82fd32e

Update expected logits and output strings in tests (tested on 4090)

5578485

update expected result under 4090

78386a0

align test result on 4090 x2

6c4902e

fix CI run_tests issue

572634c

add QianfanOCR Model

f006bad

marvinzh force-pushed the add-qianfan-ocr branch from ee8deed to 1d218c7 Compare April 16, 2026 08:00

marvinzh requested a review from vasqu April 16, 2026 12:01

vasqu added the New model label Apr 16, 2026

vasqu added 8 commits April 16, 2026 17:08

fix modular inheritance on config

92fa1c3

no need to skip

b59e6b8

Merge branch 'main' into add-qianfan-ocr

ade2eff

fix auto mappings and apply fix repo

a4a6f50

fix

36026a5

fix

4edb19e

nits

45897a8

move conversion mapping

cfd2a9c

vasqu approved these changes Apr 16, 2026

View reviewed changes

Merge branch 'main' into add-qianfan-ocr

07e7538

fix integration test for our CI

2fbd0d7

vasqu merged commit 77de8dd into huggingface:main Apr 17, 2026
27 of 29 checks passed

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

		self.lambda_1 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True)
		self.lambda_2 = nn.Parameter(init_values * torch.ones(config.hidden_size), requires_grad=True)

		return hidden_states


		class QianfanOCRVisionEncoder(nn.Module):

		from PIL import Image


		class QianfanOCRVisionText2TextModelTester:

	class Qwen3VLVisionText2TextModelTester(VLMModelTester):
	base_model_class = Qwen3VLModel
	config_class = Qwen3VLConfig
	text_config_class = Qwen3VLTextConfig
	vision_config_class = Qwen3VLVisionConfig
	conditional_generation_class = Qwen3VLForConditionalGeneration

	def __init__(self, parent, **kwargs):
	kwargs.setdefault("image_token_id", 3)
	kwargs.setdefault("video_token_id", 4)
	kwargs.setdefault("vision_start_token_id", 5)
	kwargs.setdefault("vision_end_token_id", 6)
	kwargs.setdefault("image_size", 16)
	kwargs.setdefault("patch_size", 16)
	kwargs.setdefault("num_image_tokens", 32)
	kwargs.setdefault("hidden_act", "silu")
	kwargs.setdefault("num_attention_heads", 4)
	kwargs.setdefault("num_key_value_heads", 2)
	kwargs.setdefault("head_dim", 8)
	kwargs.setdefault("depth", 2)
	kwargs.setdefault("vision_hidden_act", "gelu_pytorch_tanh")
	kwargs.setdefault("num_heads", 4)
	kwargs.setdefault("spatial_merge_size", 1)
	kwargs.setdefault("temporal_patch_size", 2)
	kwargs.setdefault("num_position_embeddings", 16)
	kwargs.setdefault("deepstack_visual_indexes", [0, 1])
	kwargs.setdefault(
	"rope_parameters",
	{
	"rope_type": "default",
	"mrope_section": [16, 8, 8],
	"mrope_interleaved": True,
	"rope_theta": 10000,
	},
	)
	super().__init__(parent, **kwargs)

	# These can be inferred from existing properties and don't get separate kwargs
	self.out_hidden_size = self.hidden_size
	self.vision_hidden_size = self.hidden_size
	self.vision_intermediate_size = self.hidden_size

	def create_pixel_values(self):
	# Qwen3VL expects flattened patches: (total_patches, channels * patch_size^2 * temporal_patch_size)
	return floats_tensor(
	[
	self.batch_size * (self.image_size2) // (self.patch_size2),
	self.num_channels * (self.patch_size*2) self.temporal_patch_size,
	]
	)

	def place_image_tokens(self, input_ids, config):
	# Place image tokens with vision_start_token_id prefix
	input_ids = input_ids.clone()
	# Clear any accidental special tokens first
	input_ids[:, -1] = self.pad_token_id
	input_ids[input_ids == self.video_token_id] = self.pad_token_id
	input_ids[input_ids == self.image_token_id] = self.pad_token_id
	input_ids[input_ids == self.vision_start_token_id] = self.pad_token_id
	# Place image tokens with vision_start_token_id prefix
	input_ids[:, 1] = self.image_token_id
	input_ids[:, 0] = self.vision_start_token_id
	return input_ids

	def get_additional_inputs(self, config, input_ids, pixel_values):
	mm_token_type_ids = torch.zeros_like(input_ids)
	mm_token_type_ids[input_ids == self.image_token_id] = 1
	return {
	"image_grid_thw": torch.tensor([[1, 1, 1]] * self.batch_size, device=torch_device),
	"mm_token_type_ids": mm_token_type_ids,
	}

	def get_config(self):
	# Qwen3VLConfig expects text_config and vision_config as dicts, not config objects
	return self.config_class(
	text_config=self.get_text_config().to_dict(),
	vision_config=self.get_vision_config().to_dict(),
	image_token_id=self.image_token_id,
	video_token_id=self.video_token_id,
	vision_start_token_id=self.vision_start_token_id,
	vision_end_token_id=self.vision_end_token_id,
	tie_word_embeddings=self.tie_word_embeddings,
	pad_token_id=self.pad_token_id,
	)


	@require_torch
	class Qwen3VLModelTest(VLMModelTest, unittest.TestCase):
	model_tester_class = Qwen3VLVisionText2TextModelTester

Conversation

marvinzh commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Multimodal LLM checklist

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marvinzh Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marvinzh commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marvinzh commented Apr 13, 2026

Uh oh!

zucchini-nlp commented Apr 14, 2026

Uh oh!

marvinzh commented Apr 14, 2026

Uh oh!

vasqu commented Apr 14, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marvinzh commented Apr 15, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marvinzh commented Apr 7, 2026 •

edited

Loading

marvinzh Apr 9, 2026 •

edited

Loading

marvinzh commented Apr 9, 2026 •

edited

Loading

marvinzh Apr 16, 2026 •

edited

Loading

vasqu commented Apr 16, 2026 •

edited

Loading

marvinzh commented Apr 17, 2026 •

edited

Loading