Skip to content

Conversation

@zoobereq
Copy link
Contributor

@zoobereq zoobereq commented May 29, 2024

What does this PR do ?

This PR implements DE TN fixes for the following issues:

  • Adds support for normalizing social media tags (e.g. @zoobereq and @zoobereq.net)
  • Adds support for normalizing comma-separated digit strings
  • Adds support for period-separated time formats (e.g. 2.30 and 02.30)
  • Fixes the issue where the final period in sentences ending with a domain name would normalize as part of that domain.

This PR does not address the following:

  • The issue with optionally period-separated cardinals (e.g. 1 Mil = 100000 | 1.000.000) , which the currently implemented DE TN/ITN doesn't support. The TN and ITN taggers have been rebuilt to accommodate for that hiccup, but since cardinals plug into almost all other DE classes, this issue will be addressed gradually as other classes are fixed/(re)developed.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

zoobereq and others added 7 commits May 29, 2024 14:55
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
…git strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
…00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
…ng with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
zoobereq and others added 5 commits May 30, 2024 08:33
@zoobereq zoobereq requested review from ekmb and tbartley94 May 30, 2024 19:04
@zoobereq zoobereq self-assigned this May 30, 2024
@zoobereq zoobereq marked this pull request as ready for review May 30, 2024 19:05
@ekmb
Copy link
Collaborator

ekmb commented May 30, 2024

@zoobereq please update grammars path in Jenkins to re-built CI grammars https://github.com/NVIDIA/NeMo-text-processing/blob/main/Jenkinsfile#L15

zoobereq and others added 3 commits June 3, 2024 09:58
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
w w w punkt a m a z o n punkt com punkt de .~www.amazon.com.de.
h t t p s doppelpunkt slash slash w w w punkt a b c punkt com slash a b fragezeichen gleichheitszeichen drei bindestrich slash a b s slash eins~https://www.abc.com/ab?=3-/abs/1
at z u c k~@zuck
at z o o b e r e q~@zoobereq
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use your name as an example. It leads to potential doxing.

Copy link
Contributor Author

@zoobereq zoobereq Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - thank you. Fixed.

@zoobereq zoobereq changed the title DE_TN_Fixes DE TN Fixes Jun 4, 2024
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
tbartley94
tbartley94 previously approved these changes Jun 4, 2024
Copy link
Member

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove references to yourself and competitors and LGTM

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
@tbartley94
Copy link
Member

@zoobereq LGTM, will approve once Evelina is happy

Copy link
Collaborator

@ekmb ekmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ekmb ekmb merged commit e2fbc45 into main Jun 6, 2024
@ekmb ekmb deleted the DE_TN_Fixes branch June 10, 2024 19:02
BuyuanCui pushed a commit that referenced this pull request Jul 12, 2024
* Adds support for social media tags (e.g. @zoobereq)

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Adds test cases for social media tags

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes pathing for Sparrowhawk

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue of the DE normalizer not accepting comma-separated digit strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the normalizer didn't accept time formatted as 00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the the sentence-final period in sentences ending with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes the formatting

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes #166 for DE

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updates grammar paths

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Minor Fixes

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes test cases

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alex Cui <alexcui1994@gmail.com>
BuyuanCui pushed a commit that referenced this pull request Aug 20, 2024
* Adds support for social media tags (e.g. @zoobereq)

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Adds test cases for social media tags

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes pathing for Sparrowhawk

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue of the DE normalizer not accepting comma-separated digit strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the normalizer didn't accept time formatted as 00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the the sentence-final period in sentences ending with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes the formatting

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes #166 for DE

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updates grammar paths

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Minor Fixes

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes test cases

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alex Cui <alexcui1994@gmail.com>
BuyuanCui pushed a commit that referenced this pull request Sep 19, 2024
* Adds support for social media tags (e.g. @zoobereq)

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Adds test cases for social media tags

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes pathing for Sparrowhawk

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue of the DE normalizer not accepting comma-separated digit strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the normalizer didn't accept time formatted as 00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the the sentence-final period in sentences ending with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes the formatting

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes #166 for DE

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updates grammar paths

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Minor Fixes

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes test cases

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alex Cui <alexcui1994@gmail.com>
BuyuanCui pushed a commit that referenced this pull request Sep 26, 2024
* Adds support for social media tags (e.g. @zoobereq)

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Adds test cases for social media tags

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes pathing for Sparrowhawk

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue of the DE normalizer not accepting comma-separated digit strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the normalizer didn't accept time formatted as 00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the the sentence-final period in sentences ending with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes the formatting

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes #166 for DE

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updates grammar paths

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Minor Fixes

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes test cases

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alex Cui <alexcui1994@gmail.com>
BuyuanCui pushed a commit that referenced this pull request Oct 16, 2024
* Adds support for social media tags (e.g. @zoobereq)

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Adds test cases for social media tags

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes pathing for Sparrowhawk

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue of the DE normalizer not accepting comma-separated digit strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the normalizer didn't accept time formatted as 00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the the sentence-final period in sentences ending with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes the formatting

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes #166 for DE

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updates grammar paths

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Minor Fixes

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes test cases

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Alex Cui <alexcui1994@gmail.com>
ngachchi pushed a commit to ngachchi/NeMo-text-processing that referenced this pull request Jun 23, 2025
* Adds support for social media tags (e.g. @zoobereq)

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Adds test cases for social media tags

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes pathing for Sparrowhawk

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue of the DE normalizer not accepting comma-separated digit strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the normalizer didn't accept time formatted as 00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the the sentence-final period in sentences ending with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes the formatting

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes NVIDIA#166 for DE

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updates grammar paths

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Minor Fixes

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes test cases

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
FredHaa pushed a commit to FredHaa/NeMo-text-processing that referenced this pull request Aug 15, 2025
* Adds support for social media tags (e.g. @zoobereq)

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Adds test cases for social media tags

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes pathing for Sparrowhawk

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue of the DE normalizer not accepting comma-separated digit strings

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the normalizer didn't accept time formatted as 00.00 Uhr or 0.00 Uhr

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes the issue where the the sentence-final period in sentences ending with a domain name would be tagged as part of that domain name

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes the formatting

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes NVIDIA#166 for DE

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updates grammar paths

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Minor Fixes

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Fixes test cases

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants