Make Argos and Azure Translation use SELECTED_CONTENT#383
Conversation
This reverts commit e059be5.
jrobble
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)
python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 69 at r1 (raw file):
Update to:
Job did not contain a feed forward track or specify selected content.
brosenberg42
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @jrobble)
python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 69 at r1 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Update to:
Job did not contain a feed forward track or specify selected content.
Done.
jrobble
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)
jrobble
left a comment
There was a problem hiding this comment.
Reviewed 23 of 23 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @brosenberg42)
python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 82 at r3 (raw file):
The language detection component in the other repo says the following to be more specific:
"Component can only process feed forward jobs for videos, "
However, in other places it just says:
"Component can only process feed forward jobs"
In general , the latter is confusing because both that component and this Fast Text component can process text files in a single-stage pipeline. Please reword it in both components for clarity.
a discussion (no related file):
Per the reviewer guide:
- When adding a new component, add it to the table in the Overview section in this github.io doc.
python/FastTextLanguageDetection/Dockerfile line 37 at r3 (raw file):
RUN python <<eot import huggingface_hub huggingface_hub.hf_hub_download(repo_id="cis-lmu/glotlid", filename="model.bin")
Have you tested this offline to make sure this works?
python/FastTextLanguageDetection/plugin-files/descriptor/descriptor.json line 23 at r3 (raw file):
"DETECTION_LANGUAGE_FASTTEXT" ], "properties": [
Does Fast Text / GlotLID have a text length threshold? I could not readily find anything online about a Fast Text / GlotLID limit.
Have you done any testing to determine the affect of text length on accuracy? If not, please run a few tests.
The NLLB component has a max length of 360 chars and uses the text splitter when over that. See here. Granted, that's a translation component and Fast Text is language detection, so it may be a non-issue.
The language detection component in the other repo uses the text splitter, but only because of limitations on the number of bytes that can be sent to the server. We're not using a server here, so there's no reason to use a text splitter for that reason alone.
python/ArgosTranslation/plugin-files/descriptor/descriptor.json line 131 at r3 (raw file):
"tasks": [ "FASTTEXT LANGUAGE ID TEXT FILE TASK", "ARGOS TRANSLATION (WITH FF REGION AND NO TASK MERGING) TASK"
Usually pipelines have task merging enabled.
I think we should either enable it here or change the name of the pipeline to "(WITH FASTTEXT LANGUAGE ID AND NO TASK MERGING)" for clarity. My preference would be to enable it. I see that you use this pipeline for a system test. Consider disabling task merging using a job property.
python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):
respectively. This represents the top predicted language for any given text input. If no language is detected, `ISO_LANGUAGE` and `ISO_SCRIPT` will be set to `<UNKNOWN>`.
The language detection component in the other repo does this:
if no_language_detected:
detection_properties["MYCOMPONENT_NO_LANGUAGE_DETECTION"] = "TRUE"
self.processing_count += 1
return
report = "; ".join(
f"lang: {r.language.lower()}-"
f"{r.script.lower()}, "
f"section: {r.section.start_index}-{r.section.end_index}, "
f"conf: {r.confidence}"
for r in report_list
)
if primary_language:
detection_properties["ISO_LANGUAGE"] = primary_language
if primary_script:
detection_properties["ISO_SCRIPT"] = primary_scriptLet's take a similar approach for the fast text component to be consistent.
Specifically, that component sets *_NO_LANGUAGE_DETECTION to true and omits the ISO_LANGUAGE and ISO_SCRIPT properties.
Alternatively, let's talk about creating a task to update the other component if you think your approach is better.
python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):
[Classical Chinese](https://en.wikipedia.org/wiki/Classical_Chinese), an arcane form of Chinese writing. When the model is provided with things like text with only whitespace or text with only emoji, it will output `lzh_Hani` with a very high confidence.
Explain that the component reports unknown instead of lzh_Hani. (Although, see my other comment about not reporting a language or script instead of unknown.)
python/FastTextLanguageDetection/LICENSE line 25 at r3 (raw file):
***************************************************************************** The GlotLID model is used by this component. It is licensed under the Apache 2.0 License.
Check line length.
brosenberg42
left a comment
There was a problem hiding this comment.
Reviewable status: 13 of 24 files reviewed, 8 unresolved discussions (waiting on @jrobble)
python/ArgosTranslation/plugin-files/descriptor/descriptor.json line 131 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Usually pipelines have task merging enabled.
I think we should either enable it here or change the name of the pipeline to "(WITH FASTTEXT LANGUAGE ID AND NO TASK MERGING)" for clarity. My preference would be to enable it. I see that you use this pipeline for a system test. Consider disabling task merging using a job property.
Users will expect the tracks with the translation to be in TRANSLATION section of the output object. They will be confused if the translation is in the LANGUAGE section. Task merging doesn't make sense for this pipeline, so including it in the name just makes things confusing.
python/FastTextLanguageDetection/Dockerfile line 37 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Have you tested this offline to make sure this works?
"This" is a line that only gets executed when running docker build. It does require network access. It downloads the model so that the component Docker image works offline.
python/FastTextLanguageDetection/LICENSE line 25 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Check line length.
Done.
python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
The language detection component in the other repo does this:
if no_language_detected: detection_properties["MYCOMPONENT_NO_LANGUAGE_DETECTION"] = "TRUE" self.processing_count += 1 return report = "; ".join( f"lang: {r.language.lower()}-" f"{r.script.lower()}, " f"section: {r.section.start_index}-{r.section.end_index}, " f"conf: {r.confidence}" for r in report_list ) if primary_language: detection_properties["ISO_LANGUAGE"] = primary_language if primary_script: detection_properties["ISO_SCRIPT"] = primary_scriptLet's take a similar approach for the fast text component to be consistent.
Specifically, that component sets
*_NO_LANGUAGE_DETECTIONto true and omits theISO_LANGUAGEandISO_SCRIPTproperties.Alternatively, let's talk about creating a task to update the other component if you think your approach is better.
In the way I did it, the job consumer only needs to check two fields. In the approach above, the user needs to check an additional field. Since the name of the component is in the field, a user needs to customize their code based on what component is doing the language detection. It prevents us from being able to use multiple language detection components interchangeably.
python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Explain that the component reports unknown instead of
lzh_Hani. (Although, see my other comment about not reporting a language or script instead of unknown.)
<UNKNOWN> is already mentioned in the readme.
python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 82 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
The language detection component in the other repo says the following to be more specific:
"Component can only process feed forward jobs for videos, "
However, in other places it just says:
"Component can only process feed forward jobs"
In general , the latter is confusing because both that component and this Fast Text component can process text files in a single-stage pipeline. Please reword it in both components for clarity.
Done.
python/FastTextLanguageDetection/plugin-files/descriptor/descriptor.json line 23 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Does Fast Text / GlotLID have a text length threshold? I could not readily find anything online about a Fast Text / GlotLID limit.
Have you done any testing to determine the affect of text length on accuracy? If not, please run a few tests.
The NLLB component has a max length of 360 chars and uses the text splitter when over that. See here. Granted, that's a translation component and Fast Text is language detection, so it may be a non-issue.
The language detection component in the other repo uses the text splitter, but only because of limitations on the number of bytes that can be sent to the server. We're not using a server here, so there's no reason to use a text splitter for that reason alone.
There is no length limit. I have not seen a case where a human could determine the language, but the component couldn't. It is not possible to determine the language of very short strings like "abc" or "animal" (it is the same in Spanish).
jrobble
left a comment
There was a problem hiding this comment.
Reviewed 11 of 11 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @brosenberg42)
python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):
Previously, brosenberg42 wrote…
In the way I did it, the job consumer only needs to check two fields. In the approach above, the user needs to check an additional field. Since the name of the component is in the field, a user needs to customize their code based on what component is doing the language detection. It prevents us from being able to use multiple language detection components interchangeably.
Please look into how downstream translation components handle <UNKNOWN>. I agree that MYCOMPONENT_* should be removed, but if we go with the <UNKNOWN> approach that property is not necessary. Our goal is to make sure the language detection components are consistent.
python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):
Previously, brosenberg42 wrote…
<UNKNOWN>is already mentioned in the readme.
Please mention unknown again here.
brosenberg42
left a comment
There was a problem hiding this comment.
Reviewable status: 23 of 24 files reviewed, 3 unresolved discussions (waiting on @jrobble)
a discussion (no related file):
Previously, jrobble (Jeff Robble) wrote…
Per the reviewer guide:
- When adding a new component, add it to the table in the Overview section in this github.io doc.
Done.
python/FastTextLanguageDetection/README.md line 21 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Please look into how downstream translation components handle
<UNKNOWN>. I agree thatMYCOMPONENT_*should be removed, but if we go with the<UNKNOWN>approach that property is not necessary. Our goal is to make sure the language detection components are consistent.
Argos produces this error: An error occurred while invoking the "get_detections_from_generic" method on the Python component: Source language, <unknown>, is not supported. (DetectionError.DETECTION_FAILED)
python/FastTextLanguageDetection/README.md line 43 at r3 (raw file):
Previously, jrobble (Jeff Robble) wrote…
Please mention unknown again here.
Done.
jrobble
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @brosenberg42)
python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 247 at r5 (raw file):
@classmethod def unknown(cls):
This should also return NO_LANGUAGE_DETECTION to be consistent with the other component, or we can remove it from the other component (unless you think there's a good reason to keep it there but not include it here).
I believe the original reason for including NO_LANGUAGE_DETECTION was to make it easier for consumers so that they just need to check for a true value there vs. an UNKNOWN string. In general, the consumers are other components via feed-forward and we're already checking for <UNKNOWN>, so it can be removed.
jrobble
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @brosenberg42)
python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 91 at r5 (raw file):
if job_feed_forward is None: raise mpf.DetectionError.UNSUPPORTED_DATA_TYPE.exception( f'Component can only process feed forward '
There are two spaces between "forward" and "jobs". Also, there is an extra space after the period.
Instead of fixing that please update this component to use fail_when_missing_feed_forward(). There are a lot of components that do feed-forward which we could update as well, but for now let's just make sure the translation ones are consistent.
brosenberg42
left a comment
There was a problem hiding this comment.
Reviewable status: 19 of 24 files reviewed, 2 unresolved discussions (waiting on @jrobble)
python/ArgosTranslation/argos_translation_component/argos_translation_component.py line 91 at r5 (raw file):
Previously, jrobble (Jeff Robble) wrote…
There are two spaces between "forward" and "jobs". Also, there is an extra space after the period.
Instead of fixing that please update this component to use
fail_when_missing_feed_forward(). There are a lot of components that do feed-forward which we could update as well, but for now let's just make sure the translation ones are consistent.
Done.
python/FastTextLanguageDetection/fast_text_language_detection_component/__init__.py line 247 at r5 (raw file):
Previously, jrobble (Jeff Robble) wrote…
This should also return
NO_LANGUAGE_DETECTIONto be consistent with the other component, or we can remove it from the other component (unless you think there's a good reason to keep it there but not include it here).I believe the original reason for including
NO_LANGUAGE_DETECTIONwas to make it easier for consumers so that they just need to check for a true value there vs. anUNKNOWNstring. In general, the consumers are other components via feed-forward and we're already checking for<UNKNOWN>, so it can be removed.
I removed NO_LANGUAGE_DETECTION
jrobble
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @brosenberg42)
Issues:
Related PRs:
This change is