Ja model improvement #410

tushuhei · 2023-12-13T06:51:35Z

This new Japanese model addresses several quality issues, incorporating a "weighted samples" approach that emphasizes fine-tune data during training. It leverages recent updates to the training script (including those in #358 and #408), and was generated using the following commands:

curl -o knbc.tar.bz2 https://nlp.ist.i.kyoto-u.ac.jp/kuntt/KNBC_v1.0_090925_utf8.tar.bz2
tar -xf knbc.tar.bz2  # this generates the KNBC_v1.0_090925_utf8 directory.
python budoux/scripts/prepare_knbc.py KNBC_v1.0_090925_utf8 -o source_knbc.txt
shuf --random-source=source_knbc.txt source_knbc.txt | split -l $[ $(wc -l source_knbc.txt | cut -d" " -f1) * 90 / 100 ]
python budoux/scripts/encode_data.py budoux/data/finetuning/ja/train.txt -o train_finetune.txt --scale=100
python budoux/scripts/encode_data.py xaa -o train_knbc.txt
cat train_knbc.txt train_finetune.txt > train.txt
python budoux/scripts/encode_data.py xab -o val.txt
python budoux/scripts/train.py train.txt --iter=150000 --val-data=val.txt --output=weights.txt --scale=1
python budoux/scripts/build_model.py weights.txt -o model.json

kojiishi

lgt m

tushuhei added 2 commits December 13, 2023 02:47

Fix all existing quality cases

1cfea36

Update Japanese model

b484a57

tushuhei requested a review from kojiishi December 13, 2023 06:51

kojiishi approved these changes Dec 13, 2023

View reviewed changes

tushuhei merged commit d6cb435 into main Dec 13, 2023

tushuhei deleted the ja-model-improvement branch December 14, 2023 04:54

tushuhei added the quality Model quality improvements label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ja model improvement #410

Ja model improvement #410

Uh oh!

tushuhei commented Dec 13, 2023

Uh oh!

kojiishi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ja model improvement #410

Ja model improvement #410

Uh oh!

Conversation

tushuhei commented Dec 13, 2023

Uh oh!

kojiishi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants