[Model] Add PP-FormulaNet Model Support#45626
[Model] Add PP-FormulaNet Model Support#45626zhang-prog wants to merge 12 commits intohuggingface:mainfrom
Conversation
vasqu
left a comment
There was a problem hiding this comment.
Heya first round 🤗 you weren't lying when you said it was more complicated :D I've made fewer comments to focus on the core first: Restructure to a VLM and use existing patterns with our normal generate pipeline
| from PIL import Image | ||
| from transformers import AutoProcessor, AutoModelForTextRecognition | ||
|
|
||
| model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors" |
There was a problem hiding this comment.
| model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors" | |
| model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors" # or "PaddlePaddle/PP-FormulaNet-L_safetensors" |
Not sure but in the docs 2 have been mentioned
| s = news | ||
| news = re.sub(r"(?!\\ )(%s)\s+?(%s)" % (noletter, noletter), r"\1\2", s) | ||
| news = re.sub(r"(?!\\ )(%s)\s+?(%s)" % (noletter, letter), r"\1\2", news) | ||
| news = re.sub(r"(%s)\s+?(%s)" % (letter, noletter), r"\1\2", news) |
There was a problem hiding this comment.
We should compile the regex outside the loops, probably similar above
| """ | ||
| text = self.remove_chinese_text_wrapping(text) | ||
| try: | ||
| from ftfy import fix_text |
There was a problem hiding this comment.
Not a fan of an extra dependency tbh but ig it is too complicated/long to adopt here
| import torch | ||
|
|
||
|
|
||
| class PPFormulaNetModelTester: |
There was a problem hiding this comment.
We can then use our VLM tester instead
transformers/tests/vlm_tester.py
Line 36 in 622b8e9
There was a problem hiding this comment.
Atp because we have a special encoder-decoder which does not fit the standard VLM style, not sure if it really fits - maybe a classic encoder-decoder approach might be better
|
@vasqu I’ve restructured the PPFormulaNet into a VLM. Some unit tests are still failing and I’m fixing them, but that shouldn’t block you from reviewing the latest model structure code. PTAL. |
vasqu
left a comment
There was a problem hiding this comment.
Much better, I focused further on the model structure - I think the core is good now, now it's details and how make it fit within our style
| from PIL import Image | ||
| from transformers import AutoProcessor, AutoModelForTextRecognition | ||
|
|
||
| model_path = "PaddlePaddle/PP-FormulaNet_plus-L_safetensors" |
|
|
||
| @auto_docstring(checkpoint="PaddlePaddle/PPFormulaNet_plus-L_safetensors") | ||
| @strict | ||
| class PPFormulaNetTextConfig(PreTrainedConfig): |
There was a problem hiding this comment.
| class PPFormulaNetTextConfig(PreTrainedConfig): | |
| class PPFormulaNetTextConfig(MBartConfig): |
We should inherit from Mbart directly, that way we don't have to think too much what is actually needed
| max_length (`int`, *optional*, defaults to 1537): | ||
| Controls the maximum length to use by one of the truncation/padding parameters. |
There was a problem hiding this comment.
You might be searching for max_position_embeddings instead or at least it should not be part of the model but the tokenizer. Probably from the old model pattern you had where you manually called generate
|
|
||
| @auto_docstring( | ||
| checkpoint="PaddlePaddle/PPFormulaNet_plus-L_safetensors" | ||
| ) # or "PaddlePaddle/PP-FormulaNet-L_safetensors" |
There was a problem hiding this comment.
| ) # or "PaddlePaddle/PP-FormulaNet-L_safetensors" | |
| ) |
tbh, would mention it in the model docs (model_doc/pp_formulanet.md) but not here because the default values are valid for that checkpoint - we only search for one example here
| decoder_outputs = self.language_model.decoder( | ||
| input_ids=decoder_input_ids, | ||
| attention_mask=decoder_attention_mask, | ||
| encoder_hidden_states=image_features, | ||
| encoder_attention_mask=attention_mask, | ||
| past_key_values=past_key_values, | ||
| inputs_embeds=decoder_inputs_embeds, | ||
| use_cache=use_cache, | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
| decoder_outputs = self.language_model.decoder( | |
| input_ids=decoder_input_ids, | |
| attention_mask=decoder_attention_mask, | |
| encoder_hidden_states=image_features, | |
| encoder_attention_mask=attention_mask, | |
| past_key_values=past_key_values, | |
| inputs_embeds=decoder_inputs_embeds, | |
| use_cache=use_cache, | |
| **kwargs, | |
| ) | |
| decoder_outputs = self.language_model( | |
| input_ids=decoder_input_ids, | |
| attention_mask=decoder_attention_mask, | |
| encoder_hidden_states=image_features, | |
| encoder_attention_mask=attention_mask, | |
| past_key_values=past_key_values, | |
| inputs_embeds=decoder_inputs_embeds, | |
| use_cache=use_cache, | |
| **kwargs, | |
| ) |
Like mentioned before would like to move away from the ForCausalLM model and use the decoder directly
| def _prepare_encoder_decoder_kwargs_for_generation(self, *args, **kwargs): | ||
| return GenerationMixin._prepare_encoder_decoder_kwargs_for_generation(*args, **kwargs) |
There was a problem hiding this comment.
| def _prepare_encoder_decoder_kwargs_for_generation(self, *args, **kwargs): | |
| return GenerationMixin._prepare_encoder_decoder_kwargs_for_generation(*args, **kwargs) | |
| def _prepare_encoder_decoder_kwargs_for_generation(self, *args, **kwargs): | |
| raise AttributeError() |
I think you just don't want to inherit? That tells modular not to
| def get_encoder(self): | ||
| return self.model.vision_tower |
There was a problem hiding this comment.
| def get_encoder(self): | |
| return self.model.vision_tower |
| encoder_last_hidden_state=encoder_outputs.last_hidden_state, | ||
| encoder_hidden_states=encoder_outputs.hidden_states, | ||
| encoder_attentions=encoder_outputs.attentions, | ||
| image_hidden_states=image_features if pixel_values is not None else None, |
There was a problem hiding this comment.
Wouldn't this fit more to image_last_hidden_state? You want the last (pooled) feature, not the set of hidden states across all of this
Imo, we can even leave this completely out imo as the encoder is everything image-related. The output class should be new and explain that the encoder == vision encoder hence different expected shapes and all
| input_ids: torch.LongTensor | None = None, | ||
| pixel_values: torch.FloatTensor | None = None, | ||
| attention_mask: torch.Tensor | None = None, | ||
| decoder_input_ids: torch.LongTensor | None = None, | ||
| decoder_attention_mask: torch.LongTensor | None = None, | ||
| decoder_inputs_embeds: torch.FloatTensor | None = None, | ||
| encoder_outputs: list[torch.FloatTensor] | None = None, | ||
| past_key_values: Cache | None = None, | ||
| inputs_embeds: torch.FloatTensor | None = None, | ||
| use_cache: bool | None = None, | ||
| **kwargs, |
There was a problem hiding this comment.
| input_ids: torch.LongTensor | None = None, | |
| pixel_values: torch.FloatTensor | None = None, | |
| attention_mask: torch.Tensor | None = None, | |
| decoder_input_ids: torch.LongTensor | None = None, | |
| decoder_attention_mask: torch.LongTensor | None = None, | |
| decoder_inputs_embeds: torch.FloatTensor | None = None, | |
| encoder_outputs: list[torch.FloatTensor] | None = None, | |
| past_key_values: Cache | None = None, | |
| inputs_embeds: torch.FloatTensor | None = None, | |
| use_cache: bool | None = None, | |
| **kwargs, | |
| pixel_values: torch.FloatTensor | None = None, | |
| attention_mask: torch.Tensor | None = None, # TODO check if this is really used, likely to be removed as well | |
| decoder_input_ids: torch.LongTensor | None = None, | |
| decoder_attention_mask: torch.LongTensor | None = None, | |
| decoder_inputs_embeds: torch.FloatTensor | None = None, | |
| encoder_outputs: list[torch.FloatTensor] | None = None, | |
| past_key_values: Cache | None = None, | |
| use_cache: bool | None = None, | |
| **kwargs, |
Noticing that we don't need those - we have pure images, no associated text in the encoder so we can leave/remove them
There was a problem hiding this comment.
main input name should be pixel values not sure if that is already the case within the pretrained model :D
There was a problem hiding this comment.
The parameters still need to remain in the argument list; otherwise, when calling self.language_model(..., **kwargs), it will raise errors like:
got multiple values for keyword argument 'attention_mask'
got multiple values for keyword argument 'input_ids'
| - `'tf'`: Return TensorFlow `tf.constant` objects. | ||
| - `'pt'`: Return PyTorch `torch.Tensor` objects. | ||
| - `'np'`: Return NumPy `np.ndarray` objects. | ||
| - `'jax'`: Return JAX `jnp.ndarray` objects. |
| def get_encoder(self): | ||
| return self.vision_tower |
There was a problem hiding this comment.
same deletion here, get_encoder accepts a modality arg and is defined in parent
vasqu
left a comment
There was a problem hiding this comment.
Thanks a lot, already looking good! Left a few comments on some less critical parts but would be still nice to fix/change 🤗
|
|
||
| import httpx | ||
| from PIL import Image | ||
| from transformers import AutoProcessor, PPFormulaNetForConditionalGeneration |
| image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"]) | ||
| return BatchFeature({**image_inputs}) | ||
|
|
||
| def normalize(self, s: str) -> str: |
There was a problem hiding this comment.
nit: lets avoid short letter and just use text or similar
| rule_noletter_noletter = re.compile(r"(?!\\ )(%s)\s+?(%s)" % (noletter, noletter)) | ||
| rule_noletter_letter = re.compile(r"(?!\\ )(%s)\s+?(%s)" % (noletter, letter)) | ||
| rule_letter_noletter = re.compile(r"(%s)\s+?(%s)" % (letter, noletter)) |
There was a problem hiding this comment.
On second thought, would it make sense to be more extreme and have those regex at init time once? Same below
| input_ids: torch.LongTensor | None = None, | ||
| attention_mask: torch.Tensor | None = None, |
There was a problem hiding this comment.
Can we mention with a small comment that we only keep this in the signature for generate compatibility?
| if encoder_outputs is None: | ||
| encoder_outputs = self.get_image_features(pixel_values, **kwargs) | ||
| elif encoder_outputs.pooler_output is None: | ||
| encoder_outputs.pooler_output = self.multi_modal_projector(encoder_outputs.last_hidden_state) |
There was a problem hiding this comment.
Imo, we shouldn't need this. Maybe we should either
- Move the projector into the encoder as well
- Adjust the generation pipeline where we prepare the encoder outputs to instead call get image features
| # test_torch_exportable = False | ||
| # model_split_percents = [0.5, 0.9] |
There was a problem hiding this comment.
| # test_torch_exportable = False | |
| # model_split_percents = [0.5, 0.9] |
| @unittest.skip(reason="PPFormulaNet does not small") | ||
| def test_model_is_small(self): | ||
| pass |
| @pytest.mark.generate | ||
| @unittest.skip(reason="PPFormulaNet does not support beam search.") | ||
| def test_beam_sample_generate(self): | ||
| pass |
There was a problem hiding this comment.
Would be nice to fix but also not that big of a deal
There was a problem hiding this comment.
Done, beam search tests are all passed
| @unittest.skip( | ||
| reason="GenerationMixin._expand_inputs_for_generation() got multiple values for keyword argument 'input_ids'" | ||
| ) | ||
| def test_generate_continue_from_past_key_values(self): |
There was a problem hiding this comment.
Hmm, should be fixed imo if possible - maybe overriding the test or something else
There was a problem hiding this comment.
Looks like the rtol/atol is maybe too low but yea no worries we can keep it skipped, not a high prio imo
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, pp_formulanet |
vasqu
left a comment
There was a problem hiding this comment.
Carefully approving because it's only small stuff now 🤗 i will check in with run-slow in a sec as well just as sanity check
| def __init__(self, config): | ||
| super().__init__(config) | ||
|
|
||
| config.vision_config.decoder_hidden_size = config.text_config.hidden_size |
There was a problem hiding this comment.
This shouldn't be necessary and I'd rather adjust the values in the config from the get go
| if encoder_outputs is None: | ||
| encoder_outputs = self.get_image_features(pixel_values, **kwargs) |
There was a problem hiding this comment.
Since we now follow the full encoder-decoder structure, it would be nicer to stay closer to them e.g.
transformers/src/transformers/models/bart/modeling_bart.py
Lines 759 to 771 in 727741f
We can still keep get image features, it just acts more as a nice utility then, not as core forward part
| if encoder_outputs is None: | ||
| encoder_outputs = self.get_image_features(pixel_values, **kwargs) | ||
|
|
||
| image_features = encoder_outputs.pooler_output.to(self.decoder.device, self.decoder.dtype) |
There was a problem hiding this comment.
Rebump, maybe missed to commit it :D
| @unittest.skip( | ||
| reason="GenerationMixin._expand_inputs_for_generation() got multiple values for keyword argument 'input_ids'" | ||
| ) | ||
| def test_generate_continue_from_past_key_values(self): |
There was a problem hiding this comment.
Looks like the rtol/atol is maybe too low but yea no worries we can keep it skipped, not a high prio imo
|
run-slow: pp_formulanet |
|
This comment contains models: ["models/pp_formulanet"] |

No description provided.