Skip to content

Conversation

@Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Jun 23, 2025

Can't we just always pass "src"/"trg" here?

Fixes sillsdev/serval#707

The issue was here:

example[translation_column] = {source_lang: row.source_text, target_lang: row.target_text}

When source_lang and target_lang are equal, the dictionary definition collapses to just one of the items.

I have yet to test the complete pipeline. But I plan to publish a development docker image and then test E2E through local Serval.


This change is Reviewable

@Enkidu93 Enkidu93 requested review from ddaspit and mshannon-sil June 23, 2025 20:37
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 162 at r1 (raw file):

        else:
            if src_lang == tgt_lang:
                train_dataset = self._corpus.filter_nonempty().to_hf_dataset("src", "trg")

Could we add a _src and _trg suffix to the language code if they are the same?


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

                tokenizer.backend_tokenizer.normalizer = norm_tok.backend_tokenizer.normalizer  # type: ignore
                if self._add_unk_src_tokens and self._add_unk_tgt_tokens:
                    lang_codes = [src_lang, tgt_lang]

We should use self._src_lang and self._tgt_lang here. The local variables should only be used when retrieving data from the dataset.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 290 at r1 (raw file):

        def preprocess_function(examples):
            if isinstance(tokenizer, (NllbTokenizer, NllbTokenizerFast)):
                inputs = [self._mpn.normalize(prefix + ex[src_lang]) for ex in examples["translation"]]

src_lang and tgt_lang should be set to the new values that are used above, so that the correct codes are used here.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 162 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Could we add a _src and _trg suffix to the language code if they are the same?

Sure! I was just following what was in silnlp.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should use self._src_lang and self._tgt_lang here. The local variables should only be used when retrieving data from the dataset.

Yes, that simplifies it.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 290 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

src_lang and tgt_lang should be set to the new values that are used above, so that the correct codes are used here.

Oh, yes, I'm silly. I had that before, and then switched it because it was changing the codes when finding missing characters. Thank you!

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.91%. Comparing base (b07fcb6) to head (84ba0ac).

Files with missing lines Patch % Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #199      +/-   ##
==========================================
- Coverage   88.92%   88.91%   -0.01%     
==========================================
  Files         282      282              
  Lines       17053    17056       +3     
==========================================
+ Hits        15165    15166       +1     
- Misses       1888     1890       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Yes, that simplifies it.

I think this can be simplified further. Something like this:

if self._add_unk_src_tokens and self._src_lang is not None:
    lang_codes.append(self._src_lang)
if self._add_unk_tgt_tokens and self._tgt_lang is not None:
    lang_codes.append(self._tgt_lang)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 214 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think this can be simplified further. Something like this:

if self._add_unk_src_tokens and self._src_lang is not None:
    lang_codes.append(self._src_lang)
if self._add_unk_tgt_tokens and self._tgt_lang is not None:
    lang_codes.append(self._tgt_lang)

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@Enkidu93 Enkidu93 force-pushed the identical_source_and_target_language_code branch from 5b28d16 to 84ba0ac Compare June 25, 2025 16:05
@Enkidu93 Enkidu93 merged commit d8aa497 into main Jun 25, 2025
13 of 14 checks passed
@Enkidu93 Enkidu93 deleted the identical_source_and_target_language_code branch June 25, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Identical source and target language codes yields poor results

4 participants