[Model] Add PP-FormulaNet Model Support by zhang-prog · Pull Request #45626 · huggingface/transformers

zhang-prog · 2026-04-24T09:25:27Z

No description provided.

vasqu

Heya first round 🤗 you weren't lying when you said it was more complicated :D I've made fewer comments to focus on the core first: Restructure to a VLM and use existing patterns with our normal generate pipeline

vasqu · 2026-04-24T13:13:36Z

+from PIL import Image
+from transformers import AutoProcessor, AutoModelForTextRecognition
+
+model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors"


Suggested change

model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors"

model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors" # or "PaddlePaddle/PP-FormulaNet-L_safetensors"

Not sure but in the docs 2 have been mentioned

vasqu · 2026-04-24T14:21:46Z

+            s = news
+            news = re.sub(r"(?!\\ )(%s)\s+?(%s)" % (noletter, noletter), r"\1\2", s)
+            news = re.sub(r"(?!\\ )(%s)\s+?(%s)" % (noletter, letter), r"\1\2", news)
+            news = re.sub(r"(%s)\s+?(%s)" % (letter, noletter), r"\1\2", news)


We should compile the regex outside the loops, probably similar above

vasqu · 2026-04-24T14:22:40Z

+        """
+        text = self.remove_chinese_text_wrapping(text)
+        try:
+            from ftfy import fix_text


Not a fan of an extra dependency tbh but ig it is too complicated/long to adopt here

vasqu · 2026-04-24T14:23:23Z

+    import torch
+
+
+class PPFormulaNetModelTester:


We can then use our VLM tester instead

transformers/tests/vlm_tester.py

Line 36 in 622b8e9

class VLMModelTester:

Atp because we have a special encoder-decoder which does not fit the standard VLM style, not sure if it really fits - maybe a classic encoder-decoder approach might be better

zhang-prog · 2026-04-27T13:04:24Z

@vasqu I’ve restructured the PPFormulaNet into a VLM. Some unit tests are still failing and I’m fixing them, but that shouldn’t block you from reviewing the latest model structure code. PTAL.

vasqu

Much better, I focused further on the model structure - I think the core is good now, now it's details and how make it fit within our style

vasqu · 2026-04-27T15:42:36Z

+from PIL import Image
+from transformers import AutoProcessor, AutoModelForTextRecognition
+
+model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors"


vasqu · 2026-04-27T15:49:07Z

+
+@auto_docstring(checkpoint="PaddlePaddle/PPFormulaNet_plus-L_safetensors")
+@strict
+class PPFormulaNetTextConfig(PreTrainedConfig):


Suggested change

class PPFormulaNetTextConfig(PreTrainedConfig):

class PPFormulaNetTextConfig(MBartConfig):

We should inherit from Mbart directly, that way we don't have to think too much what is actually needed

vasqu · 2026-04-27T15:50:35Z

+    max_length (`int`, *optional*, defaults to 1537):
+        Controls the maximum length to use by one of the truncation/padding parameters.


You might be searching for max_position_embeddings instead or at least it should not be part of the model but the tokenizer. Probably from the old model pattern you had where you manually called generate

vasqu · 2026-04-27T15:51:43Z

+
+@auto_docstring(
+    checkpoint="PaddlePaddle/PPFormulaNet_plus-L_safetensors"
+)  # or "PaddlePaddle/PP-FormulaNet-L_safetensors"


Suggested change

) # or "PaddlePaddle/PP-FormulaNet-L_safetensors"

)

tbh, would mention it in the model docs (model_doc/pp_formulanet.md) but not here because the default values are valid for that checkpoint - we only search for one example here

vasqu · 2026-04-27T16:15:04Z

+        decoder_outputs = self.language_model.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=image_features,
+            encoder_attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=decoder_inputs_embeds,
+            use_cache=use_cache,
+            **kwargs,
+        )


Suggested change

decoder_outputs = self.language_model.decoder(

input_ids=decoder_input_ids,

attention_mask=decoder_attention_mask,

encoder_hidden_states=image_features,

encoder_attention_mask=attention_mask,

past_key_values=past_key_values,

inputs_embeds=decoder_inputs_embeds,

use_cache=use_cache,

**kwargs,

)

decoder_outputs = self.language_model(

input_ids=decoder_input_ids,

attention_mask=decoder_attention_mask,

encoder_hidden_states=image_features,

encoder_attention_mask=attention_mask,

past_key_values=past_key_values,

inputs_embeds=decoder_inputs_embeds,

use_cache=use_cache,

**kwargs,

)

Like mentioned before would like to move away from the ForCausalLM model and use the decoder directly

vasqu · 2026-04-27T16:17:11Z

+    def _prepare_encoder_decoder_kwargs_for_generation(self, *args, **kwargs):
+        return GenerationMixin._prepare_encoder_decoder_kwargs_for_generation(*args, **kwargs)


Suggested change

def _prepare_encoder_decoder_kwargs_for_generation(self, *args, **kwargs):

return GenerationMixin._prepare_encoder_decoder_kwargs_for_generation(*args, **kwargs)

def _prepare_encoder_decoder_kwargs_for_generation(self, *args, **kwargs):

raise AttributeError()

I think you just don't want to inherit? That tells modular not to

vasqu · 2026-04-27T16:17:21Z

+    def get_encoder(self):
+        return self.model.vision_tower


Suggested change

def get_encoder(self):

return self.model.vision_tower

vasqu · 2026-04-27T16:21:18Z

+            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_hidden_states=encoder_outputs.hidden_states,
+            encoder_attentions=encoder_outputs.attentions,
+            image_hidden_states=image_features if pixel_values is not None else None,


Wouldn't this fit more to image_last_hidden_state? You want the last (pooled) feature, not the set of hidden states across all of this

Imo, we can even leave this completely out imo as the encoder is everything image-related. The output class should be new and explain that the encoder == vision encoder hence different expected shapes and all

vasqu · 2026-04-27T16:23:05Z

+        input_ids: torch.LongTensor | None = None,
+        pixel_values: torch.FloatTensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        decoder_input_ids: torch.LongTensor | None = None,
+        decoder_attention_mask: torch.LongTensor | None = None,
+        decoder_inputs_embeds: torch.FloatTensor | None = None,
+        encoder_outputs: list[torch.FloatTensor] | None = None,
+        past_key_values: Cache | None = None,
+        inputs_embeds: torch.FloatTensor | None = None,
+        use_cache: bool | None = None,
+        **kwargs,


Suggested change

input_ids: torch.LongTensor | None = None,

pixel_values: torch.FloatTensor | None = None,

attention_mask: torch.Tensor | None = None,

decoder_input_ids: torch.LongTensor | None = None,

decoder_attention_mask: torch.LongTensor | None = None,

decoder_inputs_embeds: torch.FloatTensor | None = None,

encoder_outputs: list[torch.FloatTensor] | None = None,

past_key_values: Cache | None = None,

inputs_embeds: torch.FloatTensor | None = None,

use_cache: bool | None = None,

**kwargs,

pixel_values: torch.FloatTensor | None = None,

attention_mask: torch.Tensor | None = None, # TODO check if this is really used, likely to be removed as well

decoder_input_ids: torch.LongTensor | None = None,

decoder_attention_mask: torch.LongTensor | None = None,

decoder_inputs_embeds: torch.FloatTensor | None = None,

encoder_outputs: list[torch.FloatTensor] | None = None,

past_key_values: Cache | None = None,

use_cache: bool | None = None,

**kwargs,

Noticing that we don't need those - we have pure images, no associated text in the encoder so we can leave/remove them

main input name should be pixel values not sure if that is already the case within the pretrained model :D

The parameters still need to remain in the argument list; otherwise, when calling self.language_model(..., **kwargs), it will raise errors like:

got multiple values for keyword argument 'attention_mask' got multiple values for keyword argument 'input_ids'

zucchini-nlp · 2026-04-27T16:38:16Z

+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.


@auto_docstring pls

zucchini-nlp · 2026-04-27T16:39:46Z

+    def get_encoder(self):
+        return self.vision_tower


same deletion here, get_encoder accepts a modality arg and is defined in parent

vasqu

Thanks a lot, already looking good! Left a few comments on some less critical parts but would be still nice to fix/change 🤗

vasqu · 2026-04-28T14:59:51Z

+
+import httpx
+from PIL import Image
+from transformers import AutoProcessor, PPFormulaNetForConditionalGeneration


auto model please

vasqu · 2026-04-28T15:09:56Z

+        image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
+        return BatchFeature({**image_inputs})
+
+    def normalize(self, s: str) -> str:


nit: lets avoid short letter and just use text or similar

vasqu · 2026-04-28T15:10:50Z

+        rule_noletter_noletter = re.compile(r"(?!\\ )(%s)\s+?(%s)" % (noletter, noletter))
+        rule_noletter_letter = re.compile(r"(?!\\ )(%s)\s+?(%s)" % (noletter, letter))
+        rule_letter_noletter = re.compile(r"(%s)\s+?(%s)" % (letter, noletter))


On second thought, would it make sense to be more extreme and have those regex at init time once? Same below

vasqu · 2026-04-28T15:15:03Z

+        input_ids: torch.LongTensor | None = None,
+        attention_mask: torch.Tensor | None = None,


Can we mention with a small comment that we only keep this in the signature for generate compatibility?

vasqu · 2026-04-28T18:13:01Z

+        if encoder_outputs is None:
+            encoder_outputs = self.get_image_features(pixel_values, **kwargs)
+        elif encoder_outputs.pooler_output is None:
+            encoder_outputs.pooler_output = self.multi_modal_projector(encoder_outputs.last_hidden_state)


Imo, we shouldn't need this. Maybe we should either

Move the projector into the encoder as well

Adjust the generation pipeline where we prepare the encoder outputs to instead call get image features

vasqu · 2026-04-28T18:16:17Z

+    # test_torch_exportable = False
+    # model_split_percents = [0.5, 0.9]


Suggested change

# test_torch_exportable = False

# model_split_percents = [0.5, 0.9]

vasqu · 2026-04-28T18:17:26Z

+    @unittest.skip(reason="PPFormulaNet does not small")
+    def test_model_is_small(self):
+        pass


Could we try? :D

Done, passed

vasqu · 2026-04-28T18:18:44Z

+    @pytest.mark.generate
+    @unittest.skip(reason="PPFormulaNet does not support beam search.")
+    def test_beam_sample_generate(self):
+        pass


Would be nice to fix but also not that big of a deal

Done, beam search tests are all passed

vasqu · 2026-04-28T18:19:18Z

+    @unittest.skip(
+        reason="GenerationMixin._expand_inputs_for_generation() got multiple values for keyword argument 'input_ids'"
+    )
+    def test_generate_continue_from_past_key_values(self):


Hmm, should be fixed imo if possible - maybe overriding the test or something else

I did try that, but it failed :(
I think it may be related to the model’s special architecture, so for now I kept it skipped.

Looks like the rtol/atol is maybe too low but yea no worries we can keep it skipped, not a high prio imo

github-actions · 2026-04-29T06:43:12Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, pp_formulanet

vasqu

Carefully approving because it's only small stuff now 🤗 i will check in with run-slow in a sec as well just as sanity check

vasqu · 2026-04-29T12:23:53Z

+    def __init__(self, config):
+        super().__init__(config)
+
+        config.vision_config.decoder_hidden_size = config.text_config.hidden_size


This shouldn't be necessary and I'd rather adjust the values in the config from the get go

vasqu · 2026-04-29T12:28:19Z

+        if encoder_outputs is None:
+            encoder_outputs = self.get_image_features(pixel_values, **kwargs)


Since we now follow the full encoder-decoder structure, it would be nicer to stay closer to them e.g.

transformers/src/transformers/models/bart/modeling_bart.py

Lines 759 to 771 in 727741f

if encoder_outputs is None:

encoder_outputs: BaseModelOutput = self.encoder(

input_ids=input_ids,

attention_mask=attention_mask,

inputs_embeds=inputs_embeds,

**kwargs,

)

elif not isinstance(encoder_outputs, BaseModelOutput):

encoder_outputs = BaseModelOutput(

last_hidden_state=encoder_outputs[0],

hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,

attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,

)

We can still keep get image features, it just acts more as a nice utility then, not as core forward part

vasqu · 2026-04-29T12:28:38Z

+        if encoder_outputs is None:
+            encoder_outputs = self.get_image_features(pixel_values, **kwargs)
+
+        image_features = encoder_outputs.pooler_output.to(self.decoder.device, self.decoder.dtype)


Is it actually needed?

vasqu · 2026-04-29T12:29:25Z

Rebump, maybe missed to commit it :D

vasqu · 2026-04-29T12:30:55Z

+    @unittest.skip(
+        reason="GenerationMixin._expand_inputs_for_generation() got multiple values for keyword argument 'input_ids'"
+    )
+    def test_generate_continue_from_past_key_values(self):


Looks like the rtol/atol is maybe too low but yea no worries we can keep it skipped, not a high prio imo

vasqu · 2026-04-29T12:32:18Z

run-slow: pp_formulanet

github-actions · 2026-04-29T12:33:57Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/pp_formulanet"]
quantizations: []

github-actions · 2026-04-29T12:41:11Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	af86d363	workflow commit (merge commit)
PR	74240ac5	branch commit (from PR)
main	a374d990	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zhang-prog added 6 commits April 23, 2026 19:17

init

84f956e

add tests and model_doc

17781d6

Merge remote-tracking branch 'origin/main' into feat/pp_formulanet

cdae177

fix style

130662a

fix

fb0b5e5

fix release date

04bf100

vasqu reviewed Apr 24, 2026

View reviewed changes

Restructure to a VLM

ea82e28

vasqu reviewed Apr 27, 2026

View reviewed changes

zucchini-nlp reviewed Apr 27, 2026

View reviewed changes

vasqu reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/models/pp_formulanet/test_modeling_pp_formulanet.py

Merge remote-tracking branch 'origin/main' into feat/pp_formulanet

dc705ff

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

zhang-prog added 3 commits April 28, 2026 20:52

Restructure to a encoder-decoder VLM

f640c7f

fix style

5a88db0

Merge remote-tracking branch 'origin/main' into feat/pp_formulanet

58ce94f

zhang-prog requested review from vasqu and zucchini-nlp April 28, 2026 13:14

vasqu reviewed Apr 28, 2026

View reviewed changes

update

74240ac

zhang-prog requested a review from vasqu April 29, 2026 09:32

vasqu approved these changes Apr 29, 2026

View reviewed changes

vasqu added the New model label Apr 29, 2026

	model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors"
	model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors" # or "PaddlePaddle/PP-FormulaNet-L_safetensors"

	class PPFormulaNetTextConfig(PreTrainedConfig):
	class PPFormulaNetTextConfig(MBartConfig):

		max_length (`int`, optional, defaults to 1537):
		Controls the maximum length to use by one of the truncation/padding parameters.

		def _prepare_encoder_decoder_kwargs_for_generation(self, args, *kwargs):
		return GenerationMixin._prepare_encoder_decoder_kwargs_for_generation(args, *kwargs)

		input_ids: torch.LongTensor \| None = None,
		attention_mask: torch.Tensor \| None = None,

		# test_torch_exportable = False
		# model_split_percents = [0.5, 0.9]

		if encoder_outputs is None:
		encoder_outputs = self.get_image_features(pixel_values, **kwargs)

	if encoder_outputs is None:
	encoder_outputs: BaseModelOutput = self.encoder(
	input_ids=input_ids,
	attention_mask=attention_mask,
	inputs_embeds=inputs_embeds,
	**kwargs,
	)
	elif not isinstance(encoder_outputs, BaseModelOutput):
	encoder_outputs = BaseModelOutput(
	last_hidden_state=encoder_outputs[0],
	hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
	attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
	)

Conversation

zhang-prog commented Apr 24, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhang-prog commented Apr 27, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!