[Model] Add PP-DocLayoutV2 Model Support by zhang-prog · Pull Request #43018 · huggingface/transformers

zhang-prog · 2025-12-23T11:18:14Z

What does this PR do?

This PR adds PP-DocLayoutV2 model to Hugging Face Transformers from PaddleOCR.

Relevant Links:

PaddleOCR
https://huggingface.co/PaddlePaddle/PP-DocLayoutV2_safetensors

Usage

Use a pipeline

import requests
from PIL import Image
from transformers import pipeline

image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/layout_demo.jpg", stream=True).raw)
layout_detector = pipeline("object-detection", model="PaddlePaddle/PP-DocLayoutV2_safetensors")
result = layout_detector(image)
print(result)

Load model directly

from transformers import AutoImageProcessor, AutoModelForObjectDetection

model_path = "PaddlePaddle/PP-DocLayoutV2_safetensors"
model = AutoModelForObjectDetection.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)

image = Image.open(requests.get("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/layout_demo.jpg", stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt")

outputs = model(**inputs)
results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]))
for result in results:
    for idx, (score, label_id, box) in enumerate(zip(result["scores"], result["labels"], result["boxes"])):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"Order {idx + 1}: {model.config.id2label[label]}: {score:.2f} {box}")

ArthurZucker · 2026-01-05T15:30:27Z

cc @molbap if you have time!

molbap · 2026-01-06T09:42:44Z

Reviewing!

molbap

Thanks for the addition! Left a few comments to start cleaning up/aligning with library standards. Let me know if you have any question and I'll re-review once addressed 🤗

molbap · 2026-01-06T10:17:36Z

+    logits = outputs.logits
+    order_logits = outputs.order_logits
+
+    order_seqs = get_order(order_logits)


more explicit naming for get_order please

molbap · 2026-01-06T10:27:04Z

+def _default_id2label() -> dict[int, str]:
+    return {
+        0: "abstract",
+        1: "algorithm",
+        2: "aside_text",
+        3: "chart",
+        4: "content",
+        5: "formula",
+        6: "doc_title",
+        7: "figure_title",
+        8: "footer",
+        9: "footer",
+        10: "footnote",
+        11: "formula_number",
+        12: "header",
+        13: "header",
+        14: "image",
+        15: "formula",
+        16: "number",
+        17: "paragraph_title",
+        18: "reference",
+        19: "reference_content",
+        20: "seal",
+        21: "table",
+        22: "text",
+        23: "text",
+        24: "vision_footnote",
+    }
+
+
+def _default_threshold_mapping() -> dict[str, float]:
+    return {
+        "abstract": 0.50,
+        "algorithm": 0.50,
+        "aside_text": 0.50,
+        "chart": 0.50,
+        "content": 0.50,
+        "formula": 0.40,
+        "doc_title": 0.40,
+        "figure_title": 0.50,
+        "footer": 0.50,
+        "footnote": 0.50,
+        "formula_number": 0.50,
+        "header": 0.50,
+        "image": 0.50,
+        "number": 0.50,
+        "paragraph_title": 0.40,
+        "reference": 0.50,
+        "reference_content": 0.50,
+        "seal": 0.45,
+        "table": 0.50,
+        "text": 0.40,
+        "vision_footnote": 0.50,
+    }
+
+
+def _default_order_map() -> dict[str, int]:
+    return {
+        "abstract": 4,
+        "algorithm": 2,
+        "aside_text": 14,
+        "chart": 1,
+        "content": 5,
+        "display_formula": 7,
+        "doc_title": 8,
+        "figure_title": 6,
+        "footer": 11,
+        "footer_image": 11,
+        "footnote": 9,
+        "formula_number": 13,
+        "header": 10,
+        "header_image": 10,
+        "image": 1,
+        "inline_formula": 2,
+        "number": 3,
+        "paragraph_title": 0,
+        "reference": 2,
+        "reference_content": 2,
+        "seal": 12,
+        "table": 1,
+        "text": 2,
+        "vertical_text": 15,
+        "vision_footnote": 6,
+    }


I think we can remove these helpers and simply use the configuration directly, augmenting it with threshold values for instance

molbap · 2026-01-06T10:31:22Z

+            `list[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
+            in the batch as predicted by the model.
+        """
+        return postprocess(outputs=outputs, threshold=threshold, target_sizes=target_sizes)


maybe would be clearer to have this postprocess method as a class method, no? or at least closer for readability

right, done.

molbap · 2026-01-06T10:31:49Z

+        self.dense = nn.Linear(config.hidden_size, self.heads * 2 * self.head_size)
+
+    def forward(self, inputs, attn_mask_1d):
+        B, N, _ = inputs.shape


in general, let's avoid single-letter variables please!

molbap · 2026-01-06T10:33:57Z

+        if self.tril_mask:
+            lower = torch.tril(torch.ones([N, N], dtype=torch.float32, device=logits.device))
+            lower = lower.bool().unsqueeze(0).unsqueeze(0)
+            logits = logits - lower.to(logits.dtype) * 1e4
+            pair_mask = torch.logical_or(pair_mask.bool(), lower)


if I understand correctly, tril_mask is always true, so the attribute can be removed and this branch too

molbap · 2026-01-06T10:34:23Z

+def box_rel_encoding(src_boxes: torch.Tensor, tgt_boxes: torch.Tensor = None, eps: float = 1e-5):
+    if tgt_boxes is None:
+        tgt_boxes = src_boxes
+    assert src_boxes.shape[-1] == 4 and tgt_boxes.shape[-1] == 4


no asserts, in general

molbap · 2026-01-06T10:37:59Z

+
+
+def get_sine_pos_embed(
+    x: torch.Tensor, num_pos_feats: int, temperature: float = 10000.0, scale: float = 100.0, exchange_xy: bool = False


exchange_xy is always False here. Also for x same comment for single letter variables

molbap · 2026-01-06T10:52:58Z

+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,


These three arguments are deprecated now. the first two need the decorator @check_model_inputs, and return_dict now uses @can_return_tuple. Then you'll need to specify in the model class attributes the modules that can record outputs with something close to that I suppose:

_can_record_outputs = { "hidden_states": OutputRecorder(PPDocLayoutV2DecoderLayer, index=0), "attentions": [ OutputRecorder(PPDocLayoutV2MultiheadAttention, index=1), OutputRecorder(PPDocLayoutV2MultiscaleDeformableAttention, index=1), ], }

Okay, I’ve made the change, but it’s causing a test failure.

I duplicated PPDocLayoutV2HybridEncoder from RTDetrHybridEncoder, but the RTDetrHybridEncoder uses a deprecated method that makes encoder_hidden_states, encoder_attentions None. This is breaking the test_attention_outputs and check_hidden_states_output tests.

Any suggestions for a fix?

The test logs are below:

ah indeed the base model uses a deprecated method as well. let me check and get back to you soon

Any ideas on how we can fix this?

Apart from modifying RT-DETR and updating it to standards, no unfortunately. doing another review today

Got it, thanks.
So, is there a plan to implement this fix in the near future? I’m asking because this issue affects multiple models.
Alternatively, I’m happy to make the RT-DETR changes myself for now if that would be helpful.
Please let me know the best way to proceed.

molbap · 2026-01-06T11:00:03Z

+
+        # custom
+        if rel_2d_pos is not None:
+            attention_scores += rel_2d_pos


I see, this will not work with fa2/etc though. LayoutLMv3 doesn't use the recent attention_interface, so this will need to be revamped afterwards, I'd prefer to update it now and implement a proper eager_attention_forward

Got it. So what’s my next step here? Is there a good reference model I can look at for this update?

Bumping this, but yea something along bert

transformers/src/transformers/models/bert/modeling_bert.py

Line 115 in 3532437

def eager_attention_forward(

FA and flex won't work, we could make SDPA work by integrating the rel bias into the mask directly. Check out t5 #42453 (at least at the point of my last commit :D)

zhang-prog · 2026-01-07T08:04:32Z

@molbap
PTAL.
There are still two issues that need to be discussed. Please review my response.
Thanks for your efforts! 🤗

github-actions · 2026-01-07T08:11:01Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43018&sha=eae5dd

zhang-prog · 2026-01-26T06:47:19Z

@molbap
I have made some modifications according to the V3 review. Maybe it can be merged soon.🤗
PTAL.

molbap

Thanks! I definitely think we should pull from #43098 or the other way around, would simplify this greatly!

molbap · 2026-01-08T12:36:51Z

+        self.image_processor = (
+            PPDocLayoutV2ImageProcessor.from_pretrained(model_path) if is_vision_available() else None
+        )


vision is required for this test to run so can be simplified

molbap · 2026-01-26T16:04:21Z

+        return result
+
+
+class LayoutLMv3TextEmbeddingsCustom(LayoutLMv3TextEmbeddings):


For this, this should not be prefixed by LayoutLMv3, but should be for this specific model. All modules should share the common prefix of the current model, here PPDocLayoutV2. Same for ReadingOrder, it should rather be something like PPDocLayoutV2ReadingOrder

molbap · 2026-01-26T16:37:00Z

+
+        # Normalize the attention scores to probabilities.
+        # Use the trick of the CogView paper to stabilize training
+        attention_probs = self.cogview_attention(attention_scores)


so you confirm you are using the cogview attention as well? else, we can drop it in the eager path perhaps and use the new attention interface?

Bumping, same question - otherwise sdpa will become impossible for now I think

molbap · 2026-01-26T17:13:26Z

+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)


this has to be removed (we don't support output_attentions, output_hidden_states, return_dict, all are handled through decorators now)

molbap · 2026-01-26T17:16:01Z

+        return out
+
+
+class LayoutLMv3SelfAttentionCustom(LayoutLMv3SelfAttention):


needs to be renamed with proper prefixes PPDocLayoutV2SelfAttention (valid across the file, all model classes need to be prefixed properly)

molbap · 2026-01-26T17:46:15Z

+
+        qw_t = qw.transpose(1, 2)
+        kw_t = kw.transpose(1, 2)
+        logits = torch.einsum("bhmd,bhnd->bhmn", qw_t, kw_t) / (self.head_size**0.5)


no einsum/einops unless unavoidable. It seems similar to v3 again, can we modularize more?

molbap · 2026-01-26T17:46:31Z

+
+
+def box_rel_encoding(src_boxes: torch.Tensor, tgt_boxes: torch.Tensor = None, eps: float = 1e-5):
+    if tgt_boxes is None:


target instead of tgt, etc

molbap · 2026-01-26T17:46:43Z

+    return out
+
+
+class PositionRelationEmbedding(nn.Module):


should be prefixed as well

molbap · 2026-01-26T17:47:25Z

+class LayoutLMv3SelfOutputCustom(LayoutLMv3SelfOutput):
+    pass
+
+
+class LayoutLMv3IntermediateCustom(LayoutLMv3Intermediate):
+    pass
+
+
+class LayoutLMv3OutputCustom(LayoutLMv3Output):
+    pass
+
+
+class LayoutLMv3AttentionCustom(LayoutLMv3Attention):


should all be prefixed as well, and instead of custom, we can write e.g. class PPDocLayoutv2Attention(LayoutLMv3Attention)

molbap · 2026-01-26T17:48:06Z

+        encoder_output = encoder_output.last_hidden_state
+        tok = encoder_output[:, 1 : 1 + seq_len, :]
+        attn_1d = torch.arange(seq_len, device=device)[None, :] < num_pred[:, None]
+        logits_bh, _ = self.relative_head(tok, attn_1d)


abbreviations to expand

molbap · 2026-02-04T13:01:07Z

Hello @zhang-prog ! Let me know if you want some help for this PR 🤗 happy to re-review if needed

zhang-prog · 2026-02-04T13:22:18Z

@molbap Hi, Pablo! I’m currently refactoring the code based on the latest RT-DETR and PP-DocLayoutV3. I will be submitting a new commit this week. 🤗

zhang-prog · 2026-02-06T10:11:31Z

@molbap I submitted my updates, but I found that LayoutLMv3SelfAttention and LayoutLMv3Encoder still depend on passing output_attentions, output_hidden_states, and return_dict. Because we are reusing this code modularly and can’t make changes on our end (reminiscent of the RT-DETR case, -.-), do you have any suggestions on how to handle this?

molbap · 2026-02-06T15:37:05Z

OK! I am indeed working on removing all of these old patterns here https://github.com/huggingface/transformers/pull/43590/changes#diff-418eaafaa5103cea9eb92c3b93c0b1d79aa420ea9c354764bd3e6d900657a9b5 but I'll make a smaller PR with just layoutlmv3 changes. apart from that, no other issues? re-reviewing then 🤗

zucchini-nlp · 2026-02-09T10:00:08Z

+        # backbone
+        backbone_config=None,
+        backbone=None,
+        use_pretrained_backbone=False,
+        use_timm_backbone=False,
+        freeze_backbone_batch_norms=True,
+        backbone_kwargs=None,


let's get rid of args here except for the backbone_config. In the model config, we need to add the correct config type with model_type

No need to call consolidate_backbone_kwargs imo, we don't want users to keep passing extra backbone-related args in the future

Done, args removed, and seems that still need to call consolidate_backbone_kwargs_to_config to instantiate the backbone.

oh, you mean we need to make sure the backbone_config is indeed a config obj and not dict?

yeah, it will raise an error if backbone_config is not a config obj.

zucchini-nlp · 2026-02-09T10:07:21Z

+    @can_return_tuple
+    @check_model_inputs


I dont' think we should be stacking these two decorators together

check_model_inputs removed.

zucchini-nlp · 2026-02-09T10:08:55Z

+            labels=labels,
+        )


kwargs not passed here or used later, so we don't need any of output_xxx stuff here? In that case, prob we keep only can_return_tuple

pass kwargs

I think the main issue here is, do you envision a point where the model would need to track hidden states/attentions? IMO I think not, so passing kwargs might not be needed. However if you choose to pass them, would be better to type them as TransformersKwargs in the signature

I'd rather keep kwargs and type it, we don't know how and whether it will be refactored. It certainly helps along modular inheritances either way

zhang-prog · 2026-02-10T08:54:51Z

@molbap Pablo, Thank you for your efforts! --- And I just wanted to ask: when can the LayoutLMv3 changes be merged? I’d like to get this PR merged as soon as possible, ideally before Feb 13. 🤗

zhang-prog · 2026-02-26T11:49:34Z

@vasqu fixed. please run slow tests again.🤗

vasqu · 2026-02-26T12:27:58Z

run-slow: pp_doclayout_v2

github-actions · 2026-02-26T12:29:13Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/pp_doclayout_v2"]
quantizations: []

github-actions · 2026-02-26T12:39:05Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	81c5f216	workflow commit (merge commit)
PR	cdace81e	branch commit (from PR)
main	d4cb8416	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

vasqu · 2026-02-26T13:19:47Z

Building docs, then merging thanks a lot for iterating and sticking through 🤗 big work

vasqu · 2026-02-26T13:22:56Z

Oh docs are not building will try to check later if you have some time now @zhang-prog

HuggingFaceDocBuilderDev · 2026-02-26T17:49:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2026-02-26T17:53:06Z

Ok I finally found the issue and it is complicated so I'd rather not expand on it 😅

I have a commit here ece7fca which fixes the issue but still needs to fill in the blanks; would be nice if you could do that then I'd merge

vasqu · 2026-02-26T18:01:45Z

run-slow: pp_doclayout_v2

github-actions · 2026-02-26T18:03:05Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/pp_doclayout_v2"]
quantizations: []

github-actions · 2026-02-26T18:13:36Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	864bb487	workflow commit (merge commit)
PR	48363ffe	branch commit (from PR)
main	b812aa91	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

github-actions · 2026-02-27T03:03:25Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, pp_doclayout_v2

github-actions · 2026-02-27T03:19:44Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, pp_doclayout_v2

zhang-prog · 2026-02-27T04:48:37Z

@vasqu Thanks! I have filled out all the blanks, maybe the docs can be built successfully now?

vasqu · 2026-02-27T09:08:28Z

Thanks a lot! Yea should work now, the commit I added resolved it already but the contents where TODOs :D with your fill ins, we are good to go!

zhang-prog · 2026-02-27T09:29:00Z

@vasqu Cool! Thanks a lot for your help.🤗

* init * add model_doc * fix * update * update * update * update * update * update * update * update * update * update * update * update * update * update * let's try this * try * fixup docs, expecting autodocstring to fail * is it this * fix * update docstring * update date --------- Co-authored-by: vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

init

460b5e8

zhang-prog changed the title ~~init~~ [Model] Add PP-DocLayoutV2 Model Support Dec 23, 2025

add model_doc

e1d13b3

ArthurZucker added the New model label Jan 5, 2026

ArthurZucker requested a review from molbap January 5, 2026 15:30

molbap reviewed Jan 6, 2026

View reviewed changes

fix

eae5dd2

zhang-prog requested a review from molbap January 7, 2026 08:42

zhang-prog added 4 commits January 24, 2026 20:05

update

1abd387

Merge remote-tracking branch 'origin/main' into feat/pp_doclayout_v2

ad310e0

update

065025e

update

38a774a

molbap reviewed Jan 26, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into feat/pp_doclayout_v2

92b77dc

update

0bced3c

zucchini-nlp reviewed Feb 9, 2026

View reviewed changes

zhang-prog added 2 commits February 9, 2026 18:24

update

8664c4f

update

b0e1590

let's try this

cdace81

vasqu enabled auto-merge (squash) February 26, 2026 13:19

vasqu and others added 4 commits February 26, 2026 17:35

try

f7525fe

Merge branch 'main' into feat/pp_doclayout_v2

9265403

fixup docs, expecting autodocstring to fail

90259d0

is it this

ece7fca

vasqu disabled auto-merge February 26, 2026 17:46

fix

48363ff

update docstring

972d516

zhang-prog added 2 commits February 27, 2026 11:18

update date

a71cda5

Merge remote-tracking branch 'origin/main' into feat/pp_doclayout_v2

0c11e88

vasqu enabled auto-merge (squash) February 27, 2026 09:07

vasqu merged commit 8e8b861 into huggingface:main Feb 27, 2026
25 checks passed



		def get_sine_pos_embed(
		x: torch.Tensor, num_pos_feats: int, temperature: float = 10000.0, scale: float = 100.0, exchange_xy: bool = False

		return result


		class LayoutLMv3TextEmbeddingsCustom(LayoutLMv3TextEmbeddings):

		return out


		class LayoutLMv3SelfAttentionCustom(LayoutLMv3SelfAttention):



		def box_rel_encoding(src_boxes: torch.Tensor, tgt_boxes: torch.Tensor = None, eps: float = 1e-5):
		if tgt_boxes is None:

Conversation

zhang-prog commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Use a pipeline

Load model directly

Uh oh!

ArthurZucker commented Jan 5, 2026

Uh oh!

molbap commented Jan 6, 2026

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhang-prog Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhang-prog commented Jan 7, 2026

Uh oh!

github-actions Bot commented Jan 7, 2026

Uh oh!

zhang-prog commented Jan 26, 2026

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zhang-prog commented Dec 23, 2025 •

edited

Loading

zhang-prog Jan 7, 2026 •

edited

Loading