Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
f3555a8
IT TN improvement on tests (#120)
mgrafu Oct 26, 2023
e67e17c
add single letter exception for roman numerals (#121)
mgrafu Oct 27, 2023
c2b9e0a
fix broken path for nondet whitelist (#124)
mgrafu Nov 3, 2023
41b21e2
Increase weights for serial (en TN) (#128)
anand-nv Nov 21, 2023
230b21e
add measures file for FR TN (#131)
mgrafu Dec 8, 2023
c4f4553
Sh jenkins (#127)
anand-nv Jan 19, 2024
5561c48
update isort - fix precommit (#138)
ekmb Feb 14, 2024
d9f749e
Armenian itn (#136)
davidks13 Feb 15, 2024
7bc3654
Fix CI (#142)
ekmb Feb 29, 2024
bf43b19
Armenian TN (#137)
davidks13 Mar 13, 2024
462d551
Marathi ITN (#134)
ChinmayPatil11 Mar 13, 2024
28409b2
jenkins fix (#150)
tbartley94 Mar 13, 2024
0032b2b
r0.3.0 release (#151)
ekmb Mar 13, 2024
5bbab9c
Fix text=line[text] to text=line[text_field] (#153)
ssh-meister Mar 19, 2024
86b1904
use real string on docstring (#157)
kevsan4 Mar 30, 2024
76f415c
Sh postprocess (#147)
anand-nv Apr 16, 2024
dea0439
update run_evaluate script for cased itn (#164)
mgrafu Apr 25, 2024
400c9fb
remove unused function from ar tn decimals (#165)
mgrafu Apr 25, 2024
36fa3af
ZH sentence-level TN (#112)
BuyuanCui Apr 30, 2024
ec331da
preparing release, updating change log (#168)
tbartley94 May 3, 2024
61054ba
hotfix (#169)
ekmb May 3, 2024
498781f
hotfix (#170)
tbartley94 May 3, 2024
d290748
DE TN Fixes (#177)
zoobereq Jun 6, 2024
df22a18
Tts en tech terms (#167)
mgrafu Jun 7, 2024
1318648
Normalizes the '%' sign (#180)
zoobereq Jun 7, 2024
7783a1c
FR TN Fixes (#181)
zoobereq Jun 7, 2024
6c19ae5
Merge branch 'main' of https://github.com/NVIDIA/NeMo-text-processing…
BuyuanCui Jul 12, 2024
85a771b
testing
BuyuanCui Jul 12, 2024
5bb4872
removing test.txt
BuyuanCui Jul 12, 2024
72341b1
fixing zh tn money curreny on l
BuyuanCui Jul 15, 2024
14ff392
bug fix on money currency l
BuyuanCui Jul 15, 2024
694c33b
updates for zh tn
BuyuanCui Jul 15, 2024
3152b35
resolving failed ci tests for money grammar
BuyuanCui Jul 16, 2024
d619eca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 16, 2024
f7416a1
updates for decimal maoney failure
BuyuanCui Jul 17, 2024
b5e6f33
removing comments
BuyuanCui Jul 17, 2024
ae47069
Merge branch 'zh_tn_bug_240712' of https://github.com/NVIDIA/NeMo-tex…
BuyuanCui Jul 17, 2024
682988c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 17, 2024
840ae1f
updates on money grammar for failure cases
BuyuanCui Jul 17, 2024
e7284e6
adding test cases in the nvbug
BuyuanCui Jul 17, 2024
bff16e0
Merge branch 'zh_tn_bug_240712' of https://github.com/NVIDIA/NeMo-tex…
BuyuanCui Jul 17, 2024
83ba1d7
updates for ci etst
BuyuanCui Jul 17, 2024
9ebb5ad
updating date for rerun
BuyuanCui Jul 17, 2024
50546b2
renaming final graphs
BuyuanCui Jul 17, 2024
9648811
Merge branch 'main' into zh_tn_bug_240712
BuyuanCui Jul 18, 2024
4219965
resolving conflicts
BuyuanCui Jul 18, 2024
63fec6f
conflicts
BuyuanCui Jul 18, 2024
17a3554
Merge branch 'zh_tn_bug_240712' of https://github.com/NVIDIA/NeMo-tex…
BuyuanCui Jul 18, 2024
f461b40
updating data
BuyuanCui Jul 18, 2024
16cb041
attempt to resolve jenkins issue
BuyuanCui Jul 18, 2024
5548a95
ci tests resolving
BuyuanCui Jul 18, 2024
739ef30
testing
BuyuanCui Jul 12, 2024
4431f6a
removing test.txt
BuyuanCui Jul 12, 2024
c1c7ef4
fixing zh tn money curreny on l
BuyuanCui Jul 15, 2024
f1d1d96
bug fix on money currency l
BuyuanCui Jul 15, 2024
f520f57
resolving failed ci tests for money grammar
BuyuanCui Jul 16, 2024
818dca0
updates for decimal maoney failure
BuyuanCui Jul 17, 2024
078fb7a
removing comments
BuyuanCui Jul 17, 2024
5087cd0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 16, 2024
66f2787
updates on money grammar for failure cases
BuyuanCui Jul 17, 2024
85c6ed5
adding test cases in the nvbug
BuyuanCui Jul 17, 2024
bea17ed
renaming final graphs
BuyuanCui Jul 17, 2024
5e26452
conflicts
BuyuanCui Jul 18, 2024
a7c8b6d
updating data
BuyuanCui Jul 18, 2024
7a64e69
attempt to resolve jenkins issue
BuyuanCui Jul 18, 2024
483f667
ci tests resolving
BuyuanCui Jul 18, 2024
47016b6
resolving conflict for ci tests update
BuyuanCui Jul 24, 2024
af3f8f7
Increase weights for serial (en TN) (#128)
anand-nv Nov 21, 2023
9a68cf9
add measures file for FR TN (#131)
mgrafu Dec 8, 2023
78aadbe
Sh jenkins (#127)
anand-nv Jan 19, 2024
4f9da16
update isort - fix precommit (#138)
ekmb Feb 14, 2024
02fae02
Armenian itn (#136)
davidks13 Feb 15, 2024
e9f32a8
Fix CI (#142)
ekmb Feb 29, 2024
f0fd38a
Armenian TN (#137)
davidks13 Mar 13, 2024
b0d58dd
Marathi ITN (#134)
ChinmayPatil11 Mar 13, 2024
a514d80
jenkins fix (#150)
tbartley94 Mar 13, 2024
66cda82
ZH sentence-level TN (#112)
BuyuanCui Apr 30, 2024
128753d
Tts en tech terms (#167)
mgrafu Jun 7, 2024
1d37427
testing
BuyuanCui Jul 12, 2024
150d864
removing test.txt
BuyuanCui Jul 12, 2024
7d0b513
fixing zh tn money curreny on l
BuyuanCui Jul 15, 2024
a1272f4
resolving failed ci tests for money grammar
BuyuanCui Jul 16, 2024
6314efb
updates for decimal maoney failure
BuyuanCui Jul 17, 2024
61b892b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 16, 2024
f8fb143
updates on money grammar for failure cases
BuyuanCui Jul 17, 2024
553d32e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 17, 2024
8002a50
renaming final graphs
BuyuanCui Jul 17, 2024
7b938de
conflicts
BuyuanCui Jul 18, 2024
0eca73d
updating data
BuyuanCui Jul 18, 2024
8e3a793
attempt to resolve jenkins issue
BuyuanCui Jul 18, 2024
2a48144
ci tests resolving
BuyuanCui Jul 18, 2024
7f451bf
committing
BuyuanCui Jul 25, 2024
6351b49
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 25, 2024
49d0058
resolving conflict
BuyuanCui Jul 26, 2024
ade0b91
Jenkins test not starting, copied form main branch
BuyuanCui Jul 26, 2024
4e149fa
copied from Nemo main, esolving Jenkins isue
BuyuanCui Jul 29, 2024
17ccdaa
copied from NeMo main, resolving Jenkins issue
BuyuanCui Jul 29, 2024
4323398
Merge branch 'main' into zh_tn_bug_240712
BuyuanCui Aug 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -476,4 +476,4 @@ pipeline {
cleanWs()
}
}
}
}
Original file line number Diff line number Diff line change
@@ -1,32 +1,32 @@
# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pynini
from pynini.lib import pynutil
from nemo_text_processing.text_normalization.zh.graph_utils import NEMO_NOT_QUOTE, GraphFst, delete_space
class WordFst(GraphFst):
'''
tokens { char: "一" } -> 一
'''
def __init__(self, deterministic: bool = True, lm: bool = False):
super().__init__(name="char", kind="verbalize", deterministic=deterministic)
graph = pynutil.delete("name: \"") + NEMO_NOT_QUOTE + pynutil.delete("\"")
graph = pynini.closure(delete_space) + graph + pynini.closure(delete_space)
self.fst = graph.optimize()
# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import pynini
from pynini.lib import pynutil

from nemo_text_processing.text_normalization.zh.graph_utils import NEMO_NOT_QUOTE, GraphFst, delete_space


class WordFst(GraphFst):
'''
tokens { char: "一" } -> 一
'''

def __init__(self, deterministic: bool = True, lm: bool = False):
super().__init__(name="char", kind="verbalize", deterministic=deterministic)

graph = pynutil.delete("name: \"") + NEMO_NOT_QUOTE + pynutil.delete("\"")
graph = pynini.closure(delete_space) + graph + pynini.closure(delete_space)
self.fst = graph.optimize()
14 changes: 14 additions & 0 deletions nemo_text_processing/text_normalization/en/taggers/electronic.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ def __init__(self, cardinal: GraphFst, deterministic: bool = True):

cc_cues = pynutil.add_weight(pynini.string_file(get_abs_path("data/electronic/cc_cues.tsv")), MIN_NEG_WEIGHT,)

cc_cues = pynutil.add_weight(pynini.string_file(get_abs_path("data/electronic/cc_cues.tsv")), MIN_NEG_WEIGHT)

accepted_symbols = pynini.project(pynini.string_file(get_abs_path("data/electronic/symbol.tsv")), "input")
accepted_common_domains = pynini.project(
pynini.string_file(get_abs_path("data/electronic/domain.tsv")), "input"
Expand Down Expand Up @@ -135,6 +137,18 @@ def __init__(self, cardinal: GraphFst, deterministic: bool = True):
)
graph |= cc_phrases

if deterministic:
# credit card cues
numbers = pynini.closure(NEMO_DIGIT, 4, 16)
cc_phrases = (
pynutil.insert("protocol: \"")
+ cc_cues
+ pynutil.insert("\" domain: \"")
+ numbers
+ pynutil.insert("\"")
)
graph |= cc_phrases

final_graph = self.add_tokens(graph)

self.fst = final_graph.optimize()
13 changes: 6 additions & 7 deletions nemo_text_processing/text_normalization/zh/taggers/money.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
from nemo_text_processing.text_normalization.zh.graph_utils import GraphFst
from nemo_text_processing.text_normalization.zh.utils import get_abs_path

# def get_quantity(decimal):
suffix = pynini.union(
"万",
"十万",
Expand Down Expand Up @@ -107,7 +106,7 @@ def __init__(self, cardinal: GraphFst, deterministic: bool = True, lm: bool = Fa
# larger money as decimals
graph_decimal = (
pynutil.insert('integer_part: \"')
+ pynini.closure(
+ (
pynini.closure(cardinal, 1)
+ pynutil.delete('.')
+ pynutil.insert('点')
Expand All @@ -117,14 +116,16 @@ def __init__(self, cardinal: GraphFst, deterministic: bool = True, lm: bool = Fa
)
graph_decimal_money = (
pynini.closure(graph_decimal, 1)
+ pynini.closure(pynutil.insert(' quantity: \"') + suffix + pynutil.insert('\"'))
+ pynini.closure((pynutil.insert(' quantity: \"') + suffix + pynutil.insert('\"')), 0, 1)
+ pynutil.insert(" ")
+ pynini.closure(currency_mandarin_component, 1)
) | (
pynini.closure(currency_component, 1)
+ pynutil.insert(" ")
+ pynini.closure(graph_decimal, 1)
+ pynini.closure(pynutil.insert(" ") + pynutil.insert('quantity: \"') + suffix + pynutil.insert('\"'))
+ pynini.closure(
(pynutil.insert(" ") + pynutil.insert('quantity: \"') + suffix + pynutil.insert('\"')), 0, 1
)
)

graph = (
Expand All @@ -134,7 +135,5 @@ def __init__(self, cardinal: GraphFst, deterministic: bool = True, lm: bool = Fa
| pynutil.add_weight(graph_decimal_money, -1.0)
)

final_graph = graph

final_graph = self.add_tokens(final_graph)
final_graph = self.add_tokens(graph)
self.fst = final_graph.optimize()
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,4 @@ testITNWord() {
shift $#

# Load shUnit2
. /workspace/shunit2/shunit2
. /workspace/shunit2/shunit2
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,4 @@ testITNWord() {
shift $#

# Load shUnit2
. /workspace/shunit2/shunit2
. /workspace/shunit2/shunit2
Original file line number Diff line number Diff line change
Expand Up @@ -119,4 +119,4 @@ testTNMath() {
shift $#

# Load shUnit2
. /workspace/shunit2/shunit2
. /workspace/shunit2/shunit2
4 changes: 3 additions & 1 deletion tests/nemo_text_processing/mr/test_cardinal.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@
from parameterized import parameterized

from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
from nemo_text_processing.text_normalization.normalize import Normalizer

Check notice

Code scanning / CodeQL

Unused import

Import of 'Normalizer' is not used.

from ..utils import CACHE_DIR, parse_test_case_file


class TestCardinal:
class TestPreprocess:

inverse_normalizer_mr = InverseNormalizer(lang='mr', cache_dir=CACHE_DIR, overwrite_cache=False)

@parameterized.expand(parse_test_case_file('mr/data_inverse_text_normalization/test_cases_cardinal.txt'))
Expand Down
1 change: 1 addition & 0 deletions tests/nemo_text_processing/mr/test_date.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from parameterized import parameterized

from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
from nemo_text_processing.text_normalization.normalize import Normalizer

Check notice

Code scanning / CodeQL

Unused import

Import of 'Normalizer' is not used.

from ..utils import CACHE_DIR, parse_test_case_file

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,22 @@
只有智商超过一定数值的人才能破解~只有智商超过一定数值的人才能破解
这是由人工智能控制的系统~这是由人工智能控制的系统
欧洲旅游目的地多到不知道怎么选~欧洲旅游目的地多到不知道怎么选
马斯科卖掉豪宅住进折叠屋~马斯科卖掉豪宅住进折叠屋
马斯科卖掉豪宅住进折叠屋~马斯科卖掉豪宅住进折叠屋
免除GOOGLE在一桩诽谤官司中的法律责任。~免除GOOGLE在一桩诽谤官司中的法律责任。
这对CHROME是有利的。~这对CHROME是有利的。
这可能是PILde使用者。~这可能是PILde使用者。
CSI侧重科学办案,也就是现场搜正和鉴识。~CSI侧重科学办案,也就是现场搜正和鉴识。
我以前非常喜欢一个软体,DRAW。~我以前非常喜欢一个软体,DRAW。
我爱你病毒。~我爱你病毒。
微软举办了RACETOMARKETCHALLENGE竞赛。~微软举办了RACETOMARKETCHALLENGE竞赛。
苹果销售量的复苏程度远超PC市场。~苹果销售量的复苏程度远超PC市场。
第三季还有两款ANDROID手机亮相。~第三季还有两款ANDROID手机亮相。
反而应试著让所有GOOGLE服务更加社交化。~反而应试著让所有GOOGLE服务更加社交化。
GOOGLE已提供一项NATIVECLIENT软体。~GOOGLE已提供一项NATIVECLIENT软体。
这些程式都支援PRE与ITUNES同步化。~这些程式都支援PRE与ITUNES同步化。
可以推断此次NTT可能也会将同样的策略用在LTE上。~可以推断此次NTT可能也会将同样的策略用在LTE上。
现今许多小型企业因成本考量被迫采用一般PC作为伺服器。~现今许多小型企业因成本考量被迫采用一般PC作为伺服器。
部落格宣布GOOGLECHROMES的诞生。~部落格宣布GOOGLECHROMES的诞生。
由ZIP订购机场接送或观光景点共乘服务。~由ZIP订购机场接送或观光景点共乘服务。
PAQUE表示短时间应该还不会全面开放。~PAQUE表示短时间应该还不会全面开放。
CBS是美国一家重要的广播电视网路公司。~CBS是美国一家重要的广播电视网路公司。
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#! /bin/sh

GRAMMARS_DIR=${1:-"/workspace/sparrowhawk/documentation/grammars"}
PROJECT_DIR=${2:-"/workspace/tests/en"}
PROJECT_DIR=${2:-"/workspace/tests"}

runtest () {
input=$1
Expand Down
1 change: 0 additions & 1 deletion tools/text_processing_deployment/export_grammars.sh
Original file line number Diff line number Diff line change
Expand Up @@ -107,4 +107,3 @@ else
echo "done mode: $MODE"
exit 0
fi

4 changes: 4 additions & 0 deletions tools/text_processing_deployment/pynini_export.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,10 @@ def parse_args():
from nemo_text_processing.inverse_text_normalization.ja.verbalizers.verbalize import (
VerbalizeFst as ITNVerbalizeFst,
)
from nemo_text_processing.text_normalization.hy.taggers.tokenize_and_classify import (
ClassifyFst as TNClassifyFst,
)
from nemo_text_processing.text_normalization.hy.verbalizers.verbalize import VerbalizeFst as TNVerbalizeFst
output_dir = os.path.join(args.output_dir, f"{args.language}_{args.grammars}_{args.input_case}")
export_grammars(
output_dir=output_dir,
Expand Down
2 changes: 1 addition & 1 deletion tools/text_processing_deployment/sh_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,4 @@ VERBALIZE_FAR=${CACHE_DIR}_${GRAMMARS}_${INPUT_CASE}/verbalize/verbalize.far
CONFIG=${LANGUAGE}_${GRAMMARS}_${INPUT_CASE}

cp $CLASSIFY_FAR /workspace/sparrowhawk/documentation/grammars_${CONFIG}/en_toy/classify/
cp $VERBALIZE_FAR /workspace/sparrowhawk/documentation/grammars_${CONFIG}/en_toy/verbalize/
cp $VERBALIZE_FAR /workspace/sparrowhawk/documentation/grammars_${CONFIG}/en_toy/verbalize/