docs: Update LayoutLMv3 model card with standardized format and impro…#37155
docs: Update LayoutLMv3 model card with standardized format and impro…#37155carrycooldude wants to merge 10 commits intohuggingface:mainfrom
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
stevhliu
left a comment
There was a problem hiding this comment.
Thanks, this is a good start! Please refer to the Gemma 3 docs see how to standardize this doc 🤗
|
|
||
| --> | ||
|
|
||
| [](https://pytorch.org/get-started/locally/) |
There was a problem hiding this comment.
Please style these with the <div> tags. You can copy it from one of the existing updated model cards on main like Gemma 3.
| # LayoutLMv3 | ||
|
|
||
| ## Overview | ||
| LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone. |
There was a problem hiding this comment.
| LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone. | |
| [LayoutLMv3](https://huggingface.co/papers/2204.08387) is a multimodal transformer model designed specifically for Document AI tasks. It unites the pretraining objective for text and images, masked language and masked image modeling, and also includes a word-patch alignment objective for even stronger text and image alignment. The model architecture is also unified and uses a more streamlined approach with patch embeddings (similar to [ViT](./vit)) instead of a CNN backbone. |
There was a problem hiding this comment.
Not fully resolved yet, missing link to the model
| <Tip> | ||
| Click on the right sidebar for more examples of how to use the model for different tasks! | ||
| </Tip> |
There was a problem hiding this comment.
| <Tip> | |
| Click on the right sidebar for more examples of how to use the model for different tasks! | |
| </Tip> | |
| > [!TIP] | |
| > Click on the LayoutLMv3 models in the right sidebar for more examples of how to apply LayoutLMv3 to different vision and language tasks. |
| outputs = model(**encoding) | ||
| ``` | ||
|
|
||
| ## Using transformers-cli |
There was a problem hiding this comment.
We can remove this since transformers-cli doesn't support image inputs
| ## Quantization | ||
|
|
||
| For large models, you can use quantization to reduce memory usage: |
There was a problem hiding this comment.
Update the code example below accordingly
| ## Quantization | |
| For large models, you can use quantization to reduce memory usage: | |
| Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends. | |
| The example below uses [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) to only quantize the weights to int4. |
There was a problem hiding this comment.
Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.
| - [Document question answering task guide](../tasks/document_question_answering) | ||
|
|
||
| ## LayoutLMv3Config | ||
| ## Model Details |
There was a problem hiding this comment.
The rest of these changes should be reverted
There was a problem hiding this comment.
Not resolved yet, the ## Model Details is still there as are the changes to the header levels of the LayoutLMv3 classes
819c757 to
5b92ea6
Compare
9368ed6 to
b0aeeec
Compare
b15eb3d to
294e6e9
Compare
|
@stevhliu , have a look on this |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hey, it is still a bit off! For example:
I suggest taking a look at the Gemma 3 model card again and trying to align your model card with it as much as possible! |
7836f29 to
61f22d5
Compare
stevhliu
left a comment
There was a problem hiding this comment.
There are a lot of unresolved changes, so please don't mark them as resolved 😅
| # LayoutLMv3 | ||
|
|
||
| ## Overview | ||
| LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone. |
There was a problem hiding this comment.
Not fully resolved yet, missing link to the model
| This unified architecture and training approach makes LayoutLMv3 particularly effective for both text-centric tasks (like form understanding and receipt analysis) and image-centric tasks (like document classification and layout analysis). | ||
|
|
||
| *Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.* | ||
| [Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base) |
There was a problem hiding this comment.
| [Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base) | |
| You can find all the original LayoutLMv3 checkpoints under the [LayoutLM](https://huggingface.co/collections/microsoft/layoutlm-6564539601de72cb631d0902) collection. |
| <Tip> | ||
| Click on the right sidebar for more examples of how to use the model for different tasks! | ||
| </Tip> |
| - [Document question answering task guide](../tasks/document_question_answering) | ||
|
|
||
| ## LayoutLMv3Config | ||
| ## Model Details |
There was a problem hiding this comment.
Not resolved yet, the ## Model Details is still there as are the changes to the header levels of the LayoutLMv3 classes
| ## Quantization | ||
|
|
||
| For large models, you can use quantization to reduce memory usage: |
There was a problem hiding this comment.
Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.
| outputs = model(**encoding) | ||
| ``` | ||
|
|
||
| ## Using transformers-cli |
| result = token_classifier("form.jpg") | ||
|
|
||
| # For question answering | ||
| qa = pipeline("document-question-answering", model="microsoft/layoutlmv3-base") |
| ## Using the Pipeline | ||
|
|
||
| The easiest way to use LayoutLMv3 is through the pipeline API: |
There was a problem hiding this comment.
Unresolved as there are still other examples here besides question answering
|
|
||
| ## Quick Start | ||
|
|
||
| Here's a quick example of how to use LayoutLMv3 for document understanding: |
|
We'll need to update the badges to include FlashAttention and the code examples to include SDPA once #35469 is merged! |
Sure , Will see to that too |
Update LayoutLMv3 Model Card Documentation
This PR updates the LayoutLMv3 model card documentation to follow the standardized format as requested in #36979. The changes improve the documentation's clarity and usability while maintaining consistency with other model cards in the repository.
What does this PR do?
This PR enhances the LayoutLMv3 model card documentation by:
The changes make the documentation more accessible and provide ready-to-use examples for different use cases, following the standardized format used in other model cards like Gemma 3, PaliGemma, and ViT.
#36979
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Since this is a documentation update for a vision-language model, I would suggest tagging: