Use 'src' and 'trg' mirroring silnlp when src and trg lang codes are equal #199

Enkidu93 · 2025-06-23T20:37:16Z

Can't we just always pass "src"/"trg" here?

The issue was here:

machine.py/machine/corpora/parallel_text_corpus.py

Line 429 in cd21706

    
           example[translation_column] = {source_lang: row.source_text, target_lang: row.target_text}

When source_lang and target_lang are equal, the dictionary definition collapses to just one of the items.

I have yet to test the complete pipeline. But I plan to publish a development docker image and then test E2E through local Serval.

This change is

ddaspit

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 162 at r1 (raw file):

        else:
            if src_lang == tgt_lang:
                train_dataset = self._corpus.filter_nonempty().to_hf_dataset("src", "trg")

Could we add a _src and _trg suffix to the language code if they are the same?

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

                tokenizer.backend_tokenizer.normalizer = norm_tok.backend_tokenizer.normalizer  # type: ignore
                if self._add_unk_src_tokens and self._add_unk_tgt_tokens:
                    lang_codes = [src_lang, tgt_lang]

We should use self._src_lang and self._tgt_lang here. The local variables should only be used when retrieving data from the dataset.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 290 at r1 (raw file):

        def preprocess_function(examples):
            if isinstance(tokenizer, (NllbTokenizer, NllbTokenizerFast)):
                inputs = [self._mpn.normalize(prefix + ex[src_lang]) for ex in examples["translation"]]

src_lang and tgt_lang should be set to the new values that are used above, so that the correct codes are used here.

Enkidu93

Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 162 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Could we add a _src and _trg suffix to the language code if they are the same?

Sure! I was just following what was in silnlp.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should use self._src_lang and self._tgt_lang here. The local variables should only be used when retrieving data from the dataset.

Yes, that simplifies it.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 290 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

src_lang and tgt_lang should be set to the new values that are used above, so that the correct codes are used here.

Oh, yes, I'm silly. I had that before, and then switched it because it was changing the codes when finding missing characters. Thank you!

codecov-commenter · 2025-06-23T21:16:01Z

Codecov Report

Attention: Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.91%. Comparing base (b07fcb6) to head (84ba0ac).

Files with missing lines	Patch %	Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #199      +/-   ##
==========================================
- Coverage   88.92%   88.91%   -0.01%     
==========================================
  Files         282      282              
  Lines       17053    17056       +3     
==========================================
+ Hits        15165    15166       +1     
- Misses       1888     1890       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ddaspit

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @mshannon-sil)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Yes, that simplifies it.

I think this can be simplified further. Something like this:

if self._add_unk_src_tokens and self._src_lang is not None:
    lang_codes.append(self._src_lang)
if self._add_unk_tgt_tokens and self._tgt_lang is not None:
    lang_codes.append(self._tgt_lang)

Enkidu93

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think this can be simplified further. Something like this:

if self._add_unk_src_tokens and self._src_lang is not None:
    lang_codes.append(self._src_lang)
if self._add_unk_tgt_tokens and self._tgt_lang is not None:
    lang_codes.append(self._tgt_lang)

Done.

ddaspit

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

ddaspit

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

…the same

… same examples

Enkidu93 requested review from ddaspit and mshannon-sil June 23, 2025 20:37

ddaspit requested changes Jun 23, 2025

View reviewed changes

Enkidu93 commented Jun 23, 2025

View reviewed changes

ddaspit reviewed Jun 23, 2025

View reviewed changes

Enkidu93 commented Jun 23, 2025

View reviewed changes

ddaspit approved these changes Jun 23, 2025

View reviewed changes

ddaspit approved these changes Jun 24, 2025

View reviewed changes

Enkidu93 added 4 commits June 25, 2025 12:05

Use 'src' and 'trg' mirroring silnlp when src and trg lang codes are …

c431412

…the same

Don't use local vars when finding missing characters

ab1210b

Compress if statements

39ca9cf

Use src_lang/tgt_lang for missing characters since it operates on the…

84ba0ac

… same examples

Enkidu93 force-pushed the identical_source_and_target_language_code branch from 5b28d16 to 84ba0ac Compare June 25, 2025 16:05

Enkidu93 merged commit d8aa497 into main Jun 25, 2025
13 of 14 checks passed

Enkidu93 deleted the identical_source_and_target_language_code branch June 25, 2025 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use 'src' and 'trg' mirroring silnlp when src and trg lang codes are equal #199

Use 'src' and 'trg' mirroring silnlp when src and trg lang codes are equal #199

Uh oh!

Enkidu93 commented Jun 23, 2025 •

edited by ddaspit

Loading

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

codecov-commenter commented Jun 23, 2025 •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Use 'src' and 'trg' mirroring silnlp when src and trg lang codes are equal #199

Use 'src' and 'trg' mirroring silnlp when src and trg lang codes are equal #199

Uh oh!

Conversation

Enkidu93 commented Jun 23, 2025 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enkidu93 commented Jun 23, 2025 •

edited by ddaspit

Loading

codecov-commenter commented Jun 23, 2025 •

edited

Loading