-
-
Notifications
You must be signed in to change notification settings - Fork 3
Use 'src' and 'trg' mirroring silnlp when src and trg lang codes are equal #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 162 at r1 (raw file):
else: if src_lang == tgt_lang: train_dataset = self._corpus.filter_nonempty().to_hf_dataset("src", "trg")
Could we add a _src and _trg suffix to the language code if they are the same?
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):
tokenizer.backend_tokenizer.normalizer = norm_tok.backend_tokenizer.normalizer # type: ignore if self._add_unk_src_tokens and self._add_unk_tgt_tokens: lang_codes = [src_lang, tgt_lang]
We should use self._src_lang and self._tgt_lang here. The local variables should only be used when retrieving data from the dataset.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 290 at r1 (raw file):
def preprocess_function(examples): if isinstance(tokenizer, (NllbTokenizer, NllbTokenizerFast)): inputs = [self._mpn.normalize(prefix + ex[src_lang]) for ex in examples["translation"]]
src_lang and tgt_lang should be set to the new values that are used above, so that the correct codes are used here.
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 162 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Could we add a
_srcand_trgsuffix to the language code if they are the same?
Sure! I was just following what was in silnlp.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We should use
self._src_langandself._tgt_langhere. The local variables should only be used when retrieving data from the dataset.
Yes, that simplifies it.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 290 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
src_langandtgt_langshould be set to the new values that are used above, so that the correct codes are used here.
Oh, yes, I'm silly. I had that before, and then switched it because it was changing the codes when finding missing characters. Thank you!
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #199 +/- ##
==========================================
- Coverage 88.92% 88.91% -0.01%
==========================================
Files 282 282
Lines 17053 17056 +3
==========================================
+ Hits 15165 15166 +1
- Misses 1888 1890 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @mshannon-sil)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Yes, that simplifies it.
I think this can be simplified further. Something like this:
if self._add_unk_src_tokens and self._src_lang is not None:
lang_codes.append(self._src_lang)
if self._add_unk_tgt_tokens and self._tgt_lang is not None:
lang_codes.append(self._tgt_lang)
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I think this can be simplified further. Something like this:
if self._add_unk_src_tokens and self._src_lang is not None: lang_codes.append(self._src_lang) if self._add_unk_tgt_tokens and self._tgt_lang is not None: lang_codes.append(self._tgt_lang)
Done.
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
5b28d16 to
84ba0ac
Compare
Can't we just always pass "src"/"trg" here?
Fixes sillsdev/serval#707
The issue was here:
machine.py/machine/corpora/parallel_text_corpus.py
Line 429 in cd21706
When
source_langandtarget_langare equal, the dictionary definition collapses to just one of the items.I have yet to test the complete pipeline. But I plan to publish a development docker image and then test E2E through local Serval.
This change is