Make Argos and Azure Translation use SELECTED_CONTENT by brosenberg42 · Pull Request #383 · openmpf/openmpf-components

brosenberg42 · 2024-09-20T17:53:29Z

Issues:

Allow users to use JSON Path to specify which parts of a JSON document to process. openmpf#1851
Create a language identification component using FastText openmpf#1862

Related PRs:

This change is

This reverts commit e059be5.

jrobble

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)

python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 69 at r1 (raw file):
Update to:

Job did not contain a feed forward track or specify selected content.

brosenberg42

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @jrobble)

python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 69 at r1 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Update to:

Job did not contain a feed forward track or specify selected content.

Done.

jrobble

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

jrobble

Reviewed 23 of 23 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @brosenberg42)

python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 82 at r3 (raw file):
The language detection component in the other repo says the following to be more specific:

"Component can only process feed forward jobs for videos, "

However, in other places it just says:

"Component can only process feed forward jobs"

In general , the latter is confusing because both that component and this Fast Text component can process text files in a single-stage pipeline. Please reword it in both components for clarity.

a discussion (no related file):
Per the reviewer guide:

When adding a new component, add it to the table in the Overview section in this github.io doc.

python/FastTextLanguageDetection/Dockerfile line 37 at r3 (raw file):

RUN python <<eot
import huggingface_hub
huggingface_hub.hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")

Have you tested this offline to make sure this works?

python/FastTextLanguageDetection/plugin-files/descriptor/descriptor.json line 23 at r3 (raw file):

          "DETECTION_LANGUAGE_FASTTEXT"
        ],
        "properties": [

Does Fast Text / GlotLID have a text length threshold? I could not readily find anything online about a Fast Text / GlotLID limit.

Have you done any testing to determine the affect of text length on accuracy? If not, please run a few tests.

The NLLB component has a max length of 360 chars and uses the text splitter when over that. See here. Granted, that's a translation component and Fast Text is language detection, so it may be a non-issue.

The language detection component in the other repo uses the text splitter, but only because of limitations on the number of bytes that can be sent to the server. We're not using a server here, so there's no reason to use a text splitter for that reason alone.

python/ArgosTranslation/plugin-files/descriptor/descriptor.json line 131 at r3 (raw file):

      "tasks": [
        "FASTTEXT LANGUAGE ID TEXT FILE TASK",
        "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) TASK"

Usually pipelines have task merging enabled.

I think we should either enable it here or change the name of the pipeline to "(WITH FASTTEXT LANGUAGE ID AND NO TASK MERGING)" for clarity. My preference would be to enable it. I see that you use this pipeline for a system test. Consider disabling task merging using a job property.

python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

  respectively. This represents the top predicted language for any given text
  input. If no language is detected, `ISO_LANGUAGE` and `ISO_SCRIPT` will be set to
  `<UNKNOWN>`.

The language detection component in the other repo does this:

        if no_language_detected:
            detection_properties["MYCOMPONENT_NO_LANGUAGE_DETECTION"] = "TRUE"
            self.processing_count += 1
            return

        report = "; ".join(
            f"lang: {r.language.lower()}-"
            f"{r.script.lower()}, "
            f"section: {r.section.start_index}-{r.section.end_index}, "
            f"conf: {r.confidence}"
            for r in report_list
        )
        if primary_language:
            detection_properties["ISO_LANGUAGE"] = primary_language
        if primary_script:
            detection_properties["ISO_SCRIPT"] = primary_script

Let's take a similar approach for the fast text component to be consistent.

Specifically, that component sets *_NO_LANGUAGE_DETECTION to true and omits the ISO_LANGUAGE and ISO_SCRIPT properties.

Alternatively, let's talk about creating a task to update the other component if you think your approach is better.

python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

[Classical Chinese](https://en.wikipedia.org/wiki/Classical_Chinese), an arcane form of Chinese
writing. When the model is provided with things like text with only whitespace or text with only
emoji, it will output `lzh_Hani` with a very high confidence.

Explain that the component reports unknown instead of lzh_Hani. (Although, see my other comment about not reporting a language or script instead of unknown.)

python/FastTextLanguageDetection/LICENSE line 25 at r3 (raw file):

*****************************************************************************

The GlotLID model is used by this component. It is licensed under the Apache 2.0 License.

Check line length.

brosenberg42

Reviewable status: 13 of 24 files reviewed, 8 unresolved discussions (waiting on @jrobble)

python/ArgosTranslation/plugin-files/descriptor/descriptor.json line 131 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Usually pipelines have task merging enabled.

I think we should either enable it here or change the name of the pipeline to "(WITH FASTTEXT LANGUAGE ID AND NO TASK MERGING)" for clarity. My preference would be to enable it. I see that you use this pipeline for a system test. Consider disabling task merging using a job property.

Users will expect the tracks with the translation to be in TRANSLATION section of the output object. They will be confused if the translation is in the LANGUAGE section. Task merging doesn't make sense for this pipeline, so including it in the name just makes things confusing.

python/FastTextLanguageDetection/Dockerfile line 37 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Have you tested this offline to make sure this works?

"This" is a line that only gets executed when running docker build. It does require network access. It downloads the model so that the component Docker image works offline.

python/FastTextLanguageDetection/LICENSE line 25 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Check line length.

Done.

python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

The language detection component in the other repo does this:

        if no_language_detected:
            detection_properties["MYCOMPONENT_NO_LANGUAGE_DETECTION"] = "TRUE"
            self.processing_count += 1
            return

        report = "; ".join(
            f"lang: {r.language.lower()}-"
            f"{r.script.lower()}, "
            f"section: {r.section.start_index}-{r.section.end_index}, "
            f"conf: {r.confidence}"
            for r in report_list
        )
        if primary_language:
            detection_properties["ISO_LANGUAGE"] = primary_language
        if primary_script:
            detection_properties["ISO_SCRIPT"] = primary_script

Let's take a similar approach for the fast text component to be consistent.

Specifically, that component sets *_NO_LANGUAGE_DETECTION to true and omits the ISO_LANGUAGE and ISO_SCRIPT properties.

Alternatively, let's talk about creating a task to update the other component if you think your approach is better.

In the way I did it, the job consumer only needs to check two fields. In the approach above, the user needs to check an additional field. Since the name of the component is in the field, a user needs to customize their code based on what component is doing the language detection. It prevents us from being able to use multiple language detection components interchangeably.

python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Explain that the component reports unknown instead of lzh_Hani. (Although, see my other comment about not reporting a language or script instead of unknown.)

<UNKNOWN> is already mentioned in the readme.

python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 82 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

The language detection component in the other repo says the following to be more specific:

"Component can only process feed forward jobs for videos, "

However, in other places it just says:

"Component can only process feed forward jobs"

In general , the latter is confusing because both that component and this Fast Text component can process text files in a single-stage pipeline. Please reword it in both components for clarity.

Done.

python/FastTextLanguageDetection/plugin-files/descriptor/descriptor.json line 23 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Does Fast Text / GlotLID have a text length threshold? I could not readily find anything online about a Fast Text / GlotLID limit.

Have you done any testing to determine the affect of text length on accuracy? If not, please run a few tests.

The NLLB component has a max length of 360 chars and uses the text splitter when over that. See here. Granted, that's a translation component and Fast Text is language detection, so it may be a non-issue.

The language detection component in the other repo uses the text splitter, but only because of limitations on the number of bytes that can be sent to the server. We're not using a server here, so there's no reason to use a text splitter for that reason alone.

There is no length limit. I have not seen a case where a human could determine the language, but the component couldn't. It is not possible to determine the language of very short strings like "abc" or "animal" (it is the same in Spanish).

jrobble

Reviewed 11 of 11 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @brosenberg42)

python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

Previously, brosenberg42 wrote…

In the way I did it, the job consumer only needs to check two fields. In the approach above, the user needs to check an additional field. Since the name of the component is in the field, a user needs to customize their code based on what component is doing the language detection. It prevents us from being able to use multiple language detection components interchangeably.

Please look into how downstream translation components handle <UNKNOWN>. I agree that MYCOMPONENT_* should be removed, but if we go with the <UNKNOWN> approach that property is not necessary. Our goal is to make sure the language detection components are consistent.

python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

Previously, brosenberg42 wrote…

<UNKNOWN> is already mentioned in the readme.

Please mention unknown again here.

brosenberg42

Reviewable status: 23 of 24 files reviewed, 3 unresolved discussions (waiting on @jrobble)

a discussion (no related file):

Previously, jrobble (Jeff Robble) wrote…

Per the reviewer guide:

When adding a new component, add it to the table in the Overview section in this github.io doc.

Done.

python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please look into how downstream translation components handle <UNKNOWN>. I agree that MYCOMPONENT_* should be removed, but if we go with the <UNKNOWN> approach that property is not necessary. Our goal is to make sure the language detection components are consistent.

Argos produces this error: An error occurred while invoking the "get_detections_from_generic" method on the Python component: Source language, <unknown>, is not supported. (DetectionError.DETECTION_FAILED)

python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):

Previously, jrobble (Jeff Robble) wrote…

Please mention unknown again here.

Done.

jrobble

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)

python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 247 at r5 (raw file):

    @classmethod
    def unknown(cls):

This should also return NO_LANGUAGE_DETECTION to be consistent with the other component, or we can remove it from the other component (unless you think there's a good reason to keep it there but not include it here).

I believe the original reason for including NO_LANGUAGE_DETECTION was to make it easier for consumers so that they just need to check for a true value there vs. an UNKNOWN string. In general, the consumers are other components via feed-forward and we're already checking for <UNKNOWN>, so it can be removed.

jrobble

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @brosenberg42)

python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 91 at r5 (raw file):

            if job_feed_forward is None:
                raise mpf.DetectionError.UNSUPPORTED_DATA_TYPE.exception(
                    f'Component can only process feed forward '

There are two spaces between "forward" and "jobs". Also, there is an extra space after the period.

Instead of fixing that please update this component to use fail_when_missing_feed_forward(). There are a lot of components that do feed-forward which we could update as well, but for now let's just make sure the translation ones are consistent.

brosenberg42

Reviewable status: 19 of 24 files reviewed, 2 unresolved discussions (waiting on @jrobble)

python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 91 at r5 (raw file):

Previously, jrobble (Jeff Robble) wrote…

There are two spaces between "forward" and "jobs". Also, there is an extra space after the period.

Instead of fixing that please update this component to use fail_when_missing_feed_forward(). There are a lot of components that do feed-forward which we could update as well, but for now let's just make sure the translation ones are consistent.

Done.

python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 247 at r5 (raw file):

Previously, jrobble (Jeff Robble) wrote…

This should also return NO_LANGUAGE_DETECTION to be consistent with the other component, or we can remove it from the other component (unless you think there's a good reason to keep it there but not include it here).

I believe the original reason for including NO_LANGUAGE_DETECTION was to make it easier for consumers so that they just need to check for a true value there vs. an UNKNOWN string. In general, the consumers are other components via feed-forward and we're already checking for <UNKNOWN>, so it can be removed.

I removed NO_LANGUAGE_DETECTION

jrobble

Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)

brosenberg42 added 6 commits July 1, 2024 12:19

Add OrToolsSubjectComponent

4656fdf

Merge branch 'develop' into feat/subject-tracking

0dac6f8

Merge branch 'develop' into feat/subject-tracking

48d3487

Make argos use SELECTED_CONTENT

e059be5

Revert "Make argos use SELECTED_CONTENT"

c103017

This reverts commit e059be5.

Make argos use SELECTED_CONTENT

473d686

brosenberg42 requested a review from jrobble September 20, 2024 17:53

brosenberg42 self-assigned this Sep 20, 2024

brosenberg42 mentioned this pull request Sep 20, 2024

Feat/json path openmpf/openmpf#1850

Merged

Address PR issues

ea2db03

jrobble requested changes Oct 10, 2024

View reviewed changes

Update log message

9b9ecd3

brosenberg42 commented Oct 11, 2024

View reviewed changes

jrobble reviewed Oct 16, 2024

View reviewed changes

brosenberg42 added 16 commits November 27, 2024 09:26

Convert subject executor to python

23af3da

Merge branch 'develop' into feat/subject-tracking

6cb5841

Add subject to subject relationships

68eb659

Add license info

d2b50a0

Merge branch 'develop' into feat/json-path

41b3c9d

Add fasttext lid

32be415

Merge branch 'develop' into feat/json-path

8a3ce42

Merge branch 'develop' into feat/json-path

2938e94

Merge branch 'develop' into feat/subject-tracking

eaae431

Add timing

44bf399

Improve performance

c8d75b1

Strip when loading traditional characters

285bf8e

Add FastText to Argos pipeline

18507af

Merge branch 'develop' into feat/fasttext-lid

9ff41dd

Merge branch 'feat/subject-tracking' into test-json-merge

6f06157

Merge branch 'test-json-merge' into test-json-multi-stage-merge

d0bef6d

Add support for selectors to azure translation

f7e1bf6

This was referenced Feb 4, 2025

Feat/json path openmpf/openmpf-docker#234

Merged

Feat/json path openmpf/openmpf.github.io#190

Merged

Merge branch 'develop' into feat/json-path

bab8ca2

jrobble changed the title ~~Make argos use SELECTED_CONTENT~~ Make Argos and Azure Translation use SELECTED_CONTENT Feb 11, 2025

jrobble requested changes Feb 13, 2025

View reviewed changes

Address PR issues

c6848fc

brosenberg42 commented Feb 14, 2025

View reviewed changes

jrobble requested changes Feb 14, 2025

View reviewed changes

Update README

7f992f0

brosenberg42 commented Feb 17, 2025

View reviewed changes

jrobble requested changes Feb 18, 2025

View reviewed changes

brosenberg42 added 2 commits February 19, 2025 08:14

Remove angle brackets from unknown

75d7f12

Update translation ff error messages

91e63d3

brosenberg42 commented Feb 19, 2025

View reviewed changes

jrobble approved these changes Feb 20, 2025

View reviewed changes

brosenberg42 merged commit 1d91c66 into develop Feb 21, 2025

brosenberg42 deleted the feat/json-path branch February 21, 2025 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Argos and Azure Translation use SELECTED_CONTENT#383

Make Argos and Azure Translation use SELECTED_CONTENT#383
brosenberg42 merged 30 commits into
developfrom
feat/json-path

brosenberg42 commented Sep 20, 2024 •

edited

Loading

Uh oh!

jrobble left a comment

Uh oh!

brosenberg42 left a comment

Uh oh!

jrobble left a comment

Uh oh!

jrobble left a comment

Uh oh!

brosenberg42 left a comment

Uh oh!

jrobble left a comment

Uh oh!

brosenberg42 left a comment

Uh oh!

jrobble left a comment

Uh oh!

jrobble left a comment

Uh oh!

brosenberg42 left a comment

Uh oh!

jrobble left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brosenberg42 commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

brosenberg42 left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

brosenberg42 left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

brosenberg42 left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

brosenberg42 left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brosenberg42 commented Sep 20, 2024 •

edited

Loading