Skip to content

Make Argos and Azure Translation use SELECTED_CONTENT#383

Merged
brosenberg42 merged 30 commits into
developfrom
feat/json-path
Feb 21, 2025
Merged

Make Argos and Azure Translation use SELECTED_CONTENT#383
brosenberg42 merged 30 commits into
developfrom
feat/json-path

Conversation

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)


python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 69 at r1 (raw file):
Update to:

Job did not contain a feed forward track or specify selected content.

Copy link
Copy Markdown
Member Author

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @jrobble)


python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 69 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Update to:

Job did not contain a feed forward track or specify selected content.

Done.

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

@jrobble jrobble changed the title Make argos use SELECTED_CONTENT Make Argos and Azure Translation use SELECTED_CONTENT Feb 11, 2025
Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 23 of 23 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @brosenberg42)


python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 82 at r3 (raw file):
The language detection component in the other repo says the following to be more specific:

"Component can only process feed forward jobs for videos, "

However, in other places it just says:

"Component can only process feed forward jobs"

In general , the latter is confusing because both that component and this Fast Text component can process text files in a single-stage pipeline. Please reword it in both components for clarity.


a discussion (no related file):
Per the reviewer guide:

  • When adding a new component, add it to the table in the Overview section in this github.io doc.

python/FastTextLanguageDetection/Dockerfile line 37 at r3 (raw file):

RUN python <<eot
import huggingface_hub
huggingface_hub.hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")

Have you tested this offline to make sure this works?


python/FastTextLanguageDetection/plugin-files/descriptor/descriptor.json line 23 at r3 (raw file):

          "DETECTION_LANGUAGE_FASTTEXT"
        ],
        "properties": [

Does Fast Text / GlotLID have a text length threshold? I could not readily find anything online about a Fast Text / GlotLID limit.

Have you done any testing to determine the affect of text length on accuracy? If not, please run a few tests.

The NLLB component has a max length of 360 chars and uses the text splitter when over that. See here. Granted, that's a translation component and Fast Text is language detection, so it may be a non-issue.

The language detection component in the other repo uses the text splitter, but only because of limitations on the number of bytes that can be sent to the server. We're not using a server here, so there's no reason to use a text splitter for that reason alone.


python/ArgosTranslation/plugin-files/descriptor/descriptor.json line 131 at r3 (raw file):

      "tasks": [
        "FASTTEXT LANGUAGE ID TEXT FILE TASK",
        "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) TASK"

Usually pipelines have task merging enabled.

I think we should either enable it here or change the name of the pipeline to "(WITH FASTTEXT LANGUAGE ID AND NO TASK MERGING)" for clarity. My preference would be to enable it. I see that you use this pipeline for a system test. Consider disabling task merging using a job property.


python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

  respectively. This represents the top predicted language for any given text
  input. If no language is detected, `ISO_LANGUAGE` and `ISO_SCRIPT` will be set to
  `<UNKNOWN>`.

The language detection component in the other repo does this:

        if no_language_detected:
            detection_properties["MYCOMPONENT_NO_LANGUAGE_DETECTION"] = "TRUE"
            self.processing_count += 1
            return

        report = "; ".join(
            f"lang: {r.language.lower()}-"
            f"{r.script.lower()}, "
            f"section: {r.section.start_index}-{r.section.end_index}, "
            f"conf: {r.confidence}"
            for r in report_list
        )
        if primary_language:
            detection_properties["ISO_LANGUAGE"] = primary_language
        if primary_script:
            detection_properties["ISO_SCRIPT"] = primary_script

Let's take a similar approach for the fast text component to be consistent.

Specifically, that component sets *_NO_LANGUAGE_DETECTION to true and omits the ISO_LANGUAGE and ISO_SCRIPT properties.

Alternatively, let's talk about creating a task to update the other component if you think your approach is better.


python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

[Classical Chinese](https://en.wikipedia.org/wiki/Classical_Chinese), an arcane form of Chinese
writing. When the model is provided with things like text with only whitespace or text with only
emoji, it will output `lzh_Hani` with a very high confidence.

Explain that the component reports unknown instead of lzh_Hani. (Although, see my other comment about not reporting a language or script instead of unknown.)


python/FastTextLanguageDetection/LICENSE line 25 at r3 (raw file):

*****************************************************************************

The GlotLID model is used by this component. It is licensed under the Apache 2.0 License.

Check line length.

Copy link
Copy Markdown
Member Author

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 24 files reviewed, 8 unresolved discussions (waiting on @jrobble)


python/ArgosTranslation/plugin-files/descriptor/descriptor.json line 131 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Usually pipelines have task merging enabled.

I think we should either enable it here or change the name of the pipeline to "(WITH FASTTEXT LANGUAGE ID AND NO TASK MERGING)" for clarity. My preference would be to enable it. I see that you use this pipeline for a system test. Consider disabling task merging using a job property.

Users will expect the tracks with the translation to be in TRANSLATION section of the output object. They will be confused if the translation is in the LANGUAGE section. Task merging doesn't make sense for this pipeline, so including it in the name just makes things confusing.


python/FastTextLanguageDetection/Dockerfile line 37 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Have you tested this offline to make sure this works?

"This" is a line that only gets executed when running docker build. It does require network access. It downloads the model so that the component Docker image works offline.


python/FastTextLanguageDetection/LICENSE line 25 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Check line length.

Done.


python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

The language detection component in the other repo does this:

        if no_language_detected:
            detection_properties["MYCOMPONENT_NO_LANGUAGE_DETECTION"] = "TRUE"
            self.processing_count += 1
            return

        report = "; ".join(
            f"lang: {r.language.lower()}-"
            f"{r.script.lower()}, "
            f"section: {r.section.start_index}-{r.section.end_index}, "
            f"conf: {r.confidence}"
            for r in report_list
        )
        if primary_language:
            detection_properties["ISO_LANGUAGE"] = primary_language
        if primary_script:
            detection_properties["ISO_SCRIPT"] = primary_script

Let's take a similar approach for the fast text component to be consistent.

Specifically, that component sets *_NO_LANGUAGE_DETECTION to true and omits the ISO_LANGUAGE and ISO_SCRIPT properties.

Alternatively, let's talk about creating a task to update the other component if you think your approach is better.

In the way I did it, the job consumer only needs to check two fields. In the approach above, the user needs to check an additional field. Since the name of the component is in the field, a user needs to customize their code based on what component is doing the language detection. It prevents us from being able to use multiple language detection components interchangeably.


python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Explain that the component reports unknown instead of lzh_Hani. (Although, see my other comment about not reporting a language or script instead of unknown.)

<UNKNOWN> is already mentioned in the readme.


python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 82 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

The language detection component in the other repo says the following to be more specific:

"Component can only process feed forward jobs for videos, "

However, in other places it just says:

"Component can only process feed forward jobs"

In general , the latter is confusing because both that component and this Fast Text component can process text files in a single-stage pipeline. Please reword it in both components for clarity.

Done.


python/FastTextLanguageDetection/plugin-files/descriptor/descriptor.json line 23 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Does Fast Text / GlotLID have a text length threshold? I could not readily find anything online about a Fast Text / GlotLID limit.

Have you done any testing to determine the affect of text length on accuracy? If not, please run a few tests.

The NLLB component has a max length of 360 chars and uses the text splitter when over that. See here. Granted, that's a translation component and Fast Text is language detection, so it may be a non-issue.

The language detection component in the other repo uses the text splitter, but only because of limitations on the number of bytes that can be sent to the server. We're not using a server here, so there's no reason to use a text splitter for that reason alone.

There is no length limit. I have not seen a case where a human could determine the language, but the component couldn't. It is not possible to determine the language of very short strings like "abc" or "animal" (it is the same in Spanish).

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @brosenberg42)


python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

Previously, brosenberg42 wrote…

In the way I did it, the job consumer only needs to check two fields. In the approach above, the user needs to check an additional field. Since the name of the component is in the field, a user needs to customize their code based on what component is doing the language detection. It prevents us from being able to use multiple language detection components interchangeably.

Please look into how downstream translation components handle <UNKNOWN>. I agree that MYCOMPONENT_* should be removed, but if we go with the <UNKNOWN> approach that property is not necessary. Our goal is to make sure the language detection components are consistent.


python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

Previously, brosenberg42 wrote…

<UNKNOWN> is already mentioned in the readme.

Please mention unknown again here.

Copy link
Copy Markdown
Member Author

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 23 of 24 files reviewed, 3 unresolved discussions (waiting on @jrobble)


a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Per the reviewer guide:

  • When adding a new component, add it to the table in the Overview section in this github.io doc.

Done.


python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please look into how downstream translation components handle <UNKNOWN>. I agree that MYCOMPONENT_* should be removed, but if we go with the <UNKNOWN> approach that property is not necessary. Our goal is to make sure the language detection components are consistent.

Argos produces this error: An error occurred while invoking the "get_detections_from_generic" method on the Python component: Source language, <unknown>, is not supported. (DetectionError.DETECTION_FAILED)


python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please mention unknown again here.

Done.

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)


python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 247 at r5 (raw file):

    @classmethod
    def unknown(cls):

This should also return NO_LANGUAGE_DETECTION to be consistent with the other component, or we can remove it from the other component (unless you think there's a good reason to keep it there but not include it here).

I believe the original reason for including NO_LANGUAGE_DETECTION was to make it easier for consumers so that they just need to check for a true value there vs. an UNKNOWN string. In general, the consumers are other components via feed-forward and we're already checking for <UNKNOWN>, so it can be removed.

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @brosenberg42)


python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 91 at r5 (raw file):

            if job_feed_forward is None:
                raise mpf.DetectionError.UNSUPPORTED_DATA_TYPE.exception(
                    f'Component can only process feed forward '

There are two spaces between "forward" and "jobs". Also, there is an extra space after the period.

Instead of fixing that please update this component to use fail_when_missing_feed_forward(). There are a lot of components that do feed-forward which we could update as well, but for now let's just make sure the translation ones are consistent.

Copy link
Copy Markdown
Member Author

@brosenberg42 brosenberg42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 19 of 24 files reviewed, 2 unresolved discussions (waiting on @jrobble)


python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 91 at r5 (raw file):

Previously, jrobble (Jeff Robble) wrote…

There are two spaces between "forward" and "jobs". Also, there is an extra space after the period.

Instead of fixing that please update this component to use fail_when_missing_feed_forward(). There are a lot of components that do feed-forward which we could update as well, but for now let's just make sure the translation ones are consistent.

Done.


python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 247 at r5 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This should also return NO_LANGUAGE_DETECTION to be consistent with the other component, or we can remove it from the other component (unless you think there's a good reason to keep it there but not include it here).

I believe the original reason for including NO_LANGUAGE_DETECTION was to make it easier for consumers so that they just need to check for a true value there vs. an UNKNOWN string. In general, the consumers are other components via feed-forward and we're already checking for <UNKNOWN>, so it can be removed.

I removed NO_LANGUAGE_DETECTION

Copy link
Copy Markdown
Member

@jrobble jrobble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

@brosenberg42 brosenberg42 merged commit 1d91c66 into develop Feb 21, 2025
@brosenberg42 brosenberg42 deleted the feat/json-path branch February 21, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants