README: Update Bert Japanese model card#39466
README: Update Bert Japanese model card#39466KeshavSingh29 wants to merge 1 commit intohuggingface:mainfrom
Conversation
| <img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAAC7lBMVEUAAADg5vYHPVgAoJH+/v76+v39/f9JbLP///9+AIgAnY3///+mcqzt8fXy9fgkXa3Ax9709fr+///9/f8qXq49qp5AaLGMwrv8/P0eW60VWawxYq8yqJzG2dytt9Wyu9elzci519Lf3O3S2efY3OrY0+Xp7PT///////+dqNCexMc6Z7AGpJeGvbenstPZ5ejQ1OfJzOLa7ejh4+/r8fT29vpccbklWK8PVa0AS6ghW63O498vYa+lsdKz1NDRt9Kw1c672tbD3tnAxt7R6OHp5vDe7OrDyuDn6vLl6/EAQKak0MgATakkppo3ZK/Bz9y8w9yzu9jey97axdvHzeG21NHH4trTwthKZrVGZLSUSpuPQJiGAI+GAI8SWKydycLL4d7f2OTi1+S9xNzL0ePT6OLGzeEAo5U0qJw/aLEAo5JFa7JBabEAp5Y4qZ2QxLyKmsm3kL2xoMOehrRNb7RIbbOZgrGre68AUqwAqZqNN5aKJ5N/lMq+qsd8kMa4pcWzh7muhLMEV69juq2kbKqgUaOTR5uMMZWLLZSGAI5VAIdEAH+ovNDHuNCnxcy3qcaYx8K8msGplrx+wLahjbYdXrV6vbMvYK9DrZ8QrZ8tqJuFms+Sos6sw8ecy8RffsNVeMCvmb43aLltv7Q4Y7EZWK4QWa1gt6meZKUdr6GOAZVeA4xPAISyveLUwtivxtKTpNJ2jcqfvcltiMiwwcfAoMVxhL+Kx7xjdrqTe60tsaNQs6KaRKACrJ6UTZwkqpqTL5pkHY4AloSgsd2ptNXPvNOOncuxxsqFl8lmg8apt8FJcr9EbryGxLqlkrkrY7dRa7ZGZLQ5t6iXUZ6PPpgVpZeJCJFKAIGareTa0+KJod3H0deY2M+esM25usmYu8d2zsJOdcBVvrCLbqcAOaaHaKQAMaScWqKBXqCXMJ2RHpiLF5NmJZAdAHN2kta11dKu1M+DkcZLdb+Mcql3TppyRJdzQ5ZtNZNlIY+DF4+voCOQAAAAZ3RSTlMABAT+MEEJ/RH+/TP+Zlv+pUo6Ifz8+fco/fz6+evr39S9nJmOilQaF/7+/f38+smmoYp6b1T+/v7++vj189zU0tDJxsGzsrKSfv34+Pf27dDOysG9t6+n/vv6+vr59uzr1tG+tZ6Qg9Ym3QAABR5JREFUSMeNlVVUG1EQhpcuxEspXqS0SKEtxQp1d3d332STTRpIQhIISQgJhODu7lAoDoUCpe7u7u7+1puGpqnCPOyZvffbOXPm/PsP9JfQgyCC+tmTABTOcbxDz/heENS7/1F+9nhvkHePG0wNDLbGWwdXL+rbLWvpmZHXD8+gMfBjTh+aSe6Gnn7lwQIOTR0c8wfX3PWgv7avbdKwf/ZoBp1Gp/PvuvXW3vw5ib7emnTW4OR+3D4jB9vjNJ/7gNvfWWeH/TO/JyYrsiKCRjVEZA3UB+96kON+DxOQ/NLE8PE5iUYgIXjFnCOlxEQMaSGVxjg4gxOnEycGz8bptuNjVx08LscIgrzH3umcn+KKtiBIyvzOO2O99aAdR8cF19oZalnCtvREUw79tCd5sow1g1UKM6kXqUx4T8wsi3sTjJ3yzDmmhenLXLpo8u45eG5y4Vvbk6kkC4LLtJMowkSQxmk4ggVJEG+7c6QpHT8vvW9X7/o7+3ELmiJi2mEzZJiz8cT6TBlanBk70cB5GGIGC1gRDdZ00yADLW1FL6gqhtvNXNG5S9gdSrk4M1qu7JAsmYshzDS4peoMrU/gT7qQdqYGZaYhxZmVbGJAm/CS/HloWyhRUlknQ9KYcExTwS80d3VNOxUZJpITYyspl0LbhArhpZCD9cRWEQuhYkNGMHToQ/2Cs6swJlb39CsllxdXX6IUKh/H5jbnSsPKjgmoaFQ1f8wRLR0UnGE/RcDEjj2jXG1WVTwUs8+zxfcrVO+vSsuOpVKxCfYZiQ0/aPKuxQbQ8lIz+DClxC8u+snlcJ7Yr1z1JPqUH0V+GDXbOwAib931Y4Imaq0NTIXPXY+N5L18GJ37SVWu+hwXff8l72Ds9XuwYIBaXPq6Shm4l+Vl/5QiOlV+uTk6YR9PxKsI9xNJny31ygK1e+nIRC1N97EGkFPI+jCpiHe5PCEy7oWqWSwRrpOvhFzcbTWMbm3ZJAOn1rUKpYIt/lDhW/5RHHteeWFN60qo98YJuoq1nK3uW5AabyspC1BcIEpOhft+SZAShYoLSvnmSfnYADUERP5jJn2h5XtsgCRuhYQqAvwTwn33+YWEKUI72HX5AtfSAZDe8F2DtPPm77afhl0EkthzuCQU0BWApgQIH9+KB0JhopMM7bJrdTRoleM2JAVNMyPF+wdoaz+XJpGoVAQ7WXUkcV7gT3oUZyi/ISIJAVKhgNp+4b4veCFhYVJw4locdSjZCp9cPUhLF9EZ3KKzURepMEtCDPP3VcWFx4UIiZIklIpFNfHpdEafIF2aRmOcrUmjohbT2WUllbmRvgfbythbQO3222fpDJoufaQPncYYuqoGtUEsCJZL6/3PR5b4syeSjZMQG/T2maGANlXT2v8S4AULWaUkCxfLyW8iW4kdka+nEMjxpL2NCwsYNBp+Q61PF43zyDg9Bm9+3NNySn78jMZUUkumqE4Gp7JmFOdP1vc8PpRrzj9+wPinCy8K1PiJ4aYbnTYpCCbDkBSbzhu2QJ1Gd82t8jI8TH51+OzvXoWbnXUOBkNW+0mWFwGcGOUVpU81/n3TOHb5oMt2FgYGjzau0Nif0Ss7Q3XB33hjjQHjHA5E5aOyIQc8CBrLdQSs3j92VG+3nNEjbkbdbBr9zm04ruvw37vh0QKOdeGIkckc80fX3KH/h7PT4BOjgCty8VZ5ux1MoO5Cf5naca2LAsEgehI+drX8o/0Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC | ||
| "> | ||
| </div> | ||
| BertJapanese is a bidirectional transformer model that keeps the same architecture as original BERT and keeps the same learning objective, i.e., to predict masked tokens in a sentence and to predict whether one sentence follows another. While the architecture is same, BertJapanese relies on specific tokenization methods (wordpiece / character) that are more suitable for Japanese text. |
There was a problem hiding this comment.
| BertJapanese is a bidirectional transformer model that keeps the same architecture as original BERT and keeps the same learning objective, i.e., to predict masked tokens in a sentence and to predict whether one sentence follows another. While the architecture is same, BertJapanese relies on specific tokenization methods (wordpiece / character) that are more suitable for Japanese text. | |
| BertJapanese is a [BERT](./bert) model pretrained on Japanese text. It uses the MeCab and WordPiece tokenizers or character tokenization. | |
| You can find all the original BERTJapanese checkpoints under the [tohoku-nlp](https://huggingface.co/tohoku-nlp) organization. | |
| > [!TIP] | |
| > This model was contributed by [tohoku-nlp](https://huggingface.co/tohoku-nlp). | |
| > | |
| > Refer to the [BERT](./bert] docs for usage examples. |
There was a problem hiding this comment.
Thanks for the comments / edits @stevhliu
div tag on line 24 is still needed.
Apart from that I have made the changes.
| </div> | ||
| BertJapanese is a bidirectional transformer model that keeps the same architecture as original BERT and keeps the same learning objective, i.e., to predict masked tokens in a sentence and to predict whether one sentence follows another. While the architecture is same, BertJapanese relies on specific tokenization methods (wordpiece / character) that are more suitable for Japanese text. | ||
|
|
||
| Check Notes for additional details. |
There was a problem hiding this comment.
| Check Notes for additional details. |
|
|
||
| Check Notes for additional details. | ||
|
|
||
| The example below demonstrates how to predict the [MASK] token with [`Pipeline`] or the [`AutoModel`] class using model with MeCab and WordPiece tokenization. |
There was a problem hiding this comment.
| The example below demonstrates how to predict the [MASK] token with [`Pipeline`] or the [`AutoModel`] class using model with MeCab and WordPiece tokenization. | |
| The example below demonstrates how to predict the [MASK] token with MeCab and WordPiece tokenization with [`Pipeline`], [`AutoModel`], and from the command line. |
| > [!TIP] | ||
| > Note that this is the base model, you need to add a task specific head and further fine-tune it to make sure you get accurate results. |
There was a problem hiding this comment.
| > [!TIP] | |
| > Note that this is the base model, you need to add a task specific head and further fine-tune it to make sure you get accurate results. |
|
|
||
| Example of using a model with MeCab and WordPiece tokenization: | ||
| ```py | ||
| import torch |
There was a problem hiding this comment.
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("tohoku-nlp/bert-base-japanese")
tokenizer = AutoTokenizer.from_pretrained("tohoku-nlp/bert-base-japanese", torch_dtype=torch.float16, device_map="auto")
text = "今日は[MASK]天気ですね。"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits
masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print(f"The predicted token is: {predicted_token}")| ```python | ||
| >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char") | ||
| >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char") | ||
| <hfoption id="transformers-cli"></hfoption> |
| </Tip> | ||
|
|
||
|
|
||
| ## BertConfig |
There was a problem hiding this comment.
Don't need to change any of the API references here
There was a problem hiding this comment.
@stevhliu In the original doc for bert_japanese
only the tokenizer ref. is mentioned. Would you like to keep the status quo here?
For ref:
## BertJapaneseTokenizer
[[autodoc]] BertJapaneseTokenizerThere was a problem hiding this comment.
Yeah lets keep it how it is originally
| >>> ## Input Japanese Text | ||
| >>> line = "吾輩は猫である。" | ||
| ## Notes | ||
| - The model architecture(same as original BERT by Google) comes in two variants: |
There was a problem hiding this comment.
Remove all these notes and replace with an example of character tokenization.
| - The model architecture(same as original BERT by Google) comes in two variants: | |
| - The example below demonstrates character tokenization. | |
| ```py | |
| add code example here |
|
|
||
| ## Overview | ||
| <hfoptions id="usage"> | ||
| <hfoption id="Pipeline"> |
There was a problem hiding this comment.
pip install transformers["ja"]
import torch
from transformers import pipeline
pipeline = pipeline("fill-mask", model="tohoku-nlp/bert-base-japanese", torch_dtype=torch.float16, device=0)
pipeline("今日は[MASK]天気ですね。")
What does this PR do?
As mentioned in #36979 , contributing to HF model cards, specifically to Bert Japanese.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@stevhliu