Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
f3ac5ca
Add ZH ITN
alexcui-nvidia Feb 8, 2023
1b34ceb
Fix copyrights and code cleanup
anand-nv Feb 9, 2023
9b61ce8
Remove invalid tests
anand-nv Feb 9, 2023
cbb4379
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 9, 2023
60f5f8c
Resolve CodeQL issues
anand-nv Feb 9, 2023
b646ce7
Cleanup
anand-nv Feb 9, 2023
f2366f2
Fix missing 'zh' option for ITN and correct comment
anand-nv Feb 9, 2023
0e24e43
Update __init__.py
BuyuanCui Mar 1, 2023
4b5ae7c
Merge branch 'main' into zh_itn
BuyuanCui Mar 9, 2023
64f37c0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 9, 2023
a9d3ec4
update for decimal test data
BuyuanCui Mar 9, 2023
04f1aee
update for langauge import
BuyuanCui Mar 14, 2023
cbeeba0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 14, 2023
4b0cad3
update for Chinese punctuations
BuyuanCui Mar 14, 2023
ba8e110
a new class for whitelist
BuyuanCui Mar 14, 2023
992a644
PYNINI_AVAILABLE = False
BuyuanCui Mar 27, 2023
b8134fb
recreated due to file import format issue
BuyuanCui May 25, 2023
f2bd6d2
recreated due to format issue
BuyuanCui May 25, 2023
fc17b3a
caught duplicates, removed
BuyuanCui May 25, 2023
fe52b29
removed duplicates, arranges for CHInese Yuan updates
BuyuanCui May 25, 2023
63ee92a
updates accordingly to the comments from last PR. Recreated some of t…
BuyuanCui May 25, 2023
1481d2c
removed the hours_to and minute_to files used for back counting. ALso…
BuyuanCui May 25, 2023
d40a499
re-added this file to avoid data file import error
BuyuanCui May 25, 2023
7a822f3
updated gramamr according to last PR. Removed the acceptance of 千
BuyuanCui May 25, 2023
37b7be2
updates
BuyuanCui May 25, 2023
5cf6d45
updated according to last PR. Removed comma after decimal points
BuyuanCui May 25, 2023
eb39270
gramamr for Fraction
BuyuanCui May 25, 2023
4fcda3d
gramamr for money and updated according to last PR. Plus process of 元
BuyuanCui May 25, 2023
60fddba
ordinal grammar. updates due to the updates in cardinal grammar
BuyuanCui May 25, 2023
7374ef5
updated accordingly to last PR comments. removing am and pm and allow…
BuyuanCui May 25, 2023
bb7f905
arrangements
BuyuanCui May 25, 2023
608e98b
added whitelist grammar
BuyuanCui May 25, 2023
a17090b
word grammar for non-classified items
BuyuanCui May 25, 2023
1d2af16
updated cardinal, decimal, time, itn data
BuyuanCui May 25, 2023
7c9866d
updates according to last PR
BuyuanCui May 25, 2023
7a5e8df
updates according to the updates for cardinal grammar
BuyuanCui May 25, 2023
d4f9585
updates for more Mandarin punctuations
BuyuanCui May 25, 2023
d4d1555
updated accordingly to last PR. removing am pm
BuyuanCui May 25, 2023
c25bada
adjustment on the weight
BuyuanCui May 25, 2023
b5c8497
updated accordingly to the targger updates
BuyuanCui May 25, 2023
2113a7d
updated accordingly to the time tagger
BuyuanCui May 25, 2023
785cbb7
updates according to changes in tagger on am and pm
BuyuanCui May 25, 2023
ceae274
verbalizer for fraction
BuyuanCui May 25, 2023
aeae379
added for mandarin grammar
BuyuanCui May 25, 2023
5852b41
kept this file because using English utils results in data namin error
BuyuanCui May 25, 2023
d018a0c
merge conflict
BuyuanCui May 25, 2023
092743c
iMerge branch 'zh_itn' of github.com:NVIDIA/NeMo-text-processing into…
BuyuanCui May 25, 2023
c72c7cb
removed unsed imports
BuyuanCui May 29, 2023
5a363e2
deleted unsed import os
BuyuanCui May 29, 2023
8a8b1df
deleted unsed variables
BuyuanCui May 29, 2023
434f041
removed unsed imports
BuyuanCui May 29, 2023
5278e98
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 29, 2023
40b6bc9
updates and edits based on pr checks
BuyuanCui May 31, 2023
03fb6f0
updates and edits based on pr checks
BuyuanCui May 31, 2023
91fa0d4
format issue, reccreated
BuyuanCui May 31, 2023
130b351
format issue recreated
BuyuanCui May 31, 2023
5b77573
Merge branch 'zh_itn' of github.com:NVIDIA/NeMo-text-processing into …
BuyuanCui May 31, 2023
ae1f3a8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2023
dde4136
fixed codeing style/format
BuyuanCui May 31, 2023
07d7e94
fixed coding style and format
BuyuanCui May 31, 2023
16c4b8f
Merge branch 'zh_itn' of github.com:NVIDIA/NeMo-text-processing into …
BuyuanCui May 31, 2023
6759721
removed duplicated graph for 毛
BuyuanCui Jun 7, 2023
bb27669
Merge branch 'main' into zh_itn
BuyuanCui Jun 14, 2023
60fd16c
removed the comment
BuyuanCui Jun 27, 2023
a4bc7cc
removed the comment
BuyuanCui Jun 27, 2023
bea168e
removing unnecessary comments
BuyuanCui Jun 27, 2023
d4905ce
unnecessary comment removed
BuyuanCui Jun 27, 2023
92cbc07
test file updated for more cases
BuyuanCui Jun 27, 2023
dca3168
Merge branch 'zh_itn' of github.com:NVIDIA/NeMo-text-processing into …
BuyuanCui Jun 27, 2023
12fb036
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 27, 2023
545a54a
updated with a comment explaining why this file is kept
BuyuanCui Jun 27, 2023
d058421
updated the file explaining why this file is kept
BuyuanCui Jun 27, 2023
73dff6f
Merge branch 'zh_itn' of github.com:NVIDIA/NeMo-text-processing into …
BuyuanCui Jun 27, 2023
be67818
added Mandarin as zh
BuyuanCui Jun 27, 2023
476fa61
removing for dplication
BuyuanCui Jun 27, 2023
06768d2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 27, 2023
d5c4025
removed unused NEMO objects
BuyuanCui Jun 27, 2023
05f70e7
Merge branch 'zh_itn' of github.com:NVIDIA/NeMo-text-processing into …
BuyuanCui Jun 28, 2023
cbf6ffc
removed duplicates
BuyuanCui Jun 28, 2023
2cd9af4
removing unsed imports
BuyuanCui Jun 28, 2023
cb7fb16
updates to fix test file failures
BuyuanCui Jun 29, 2023
7425d89
updates to fix file failtures
BuyuanCui Jun 29, 2023
ee19a6a
updates to resolve test case failture
BuyuanCui Jun 29, 2023
34c5702
updates to resolve test case failure
BuyuanCui Jun 29, 2023
1883240
updates to resolve test case failure
BuyuanCui Jun 29, 2023
b05356f
updates to resolve test case failure
BuyuanCui Jun 29, 2023
9d53722
updates to adap to cardinal grammar changes
BuyuanCui Jun 29, 2023
bf4868a
updates to adapt to grammar changes
BuyuanCui Jun 29, 2023
a8b7e72
updates to adopt to cardinal grammar changes
BuyuanCui Jun 29, 2023
5f58f52
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 29, 2023
b8ed959
fix style
BuyuanCui Jun 29, 2023
abdb582
fix style
BuyuanCui Jun 29, 2023
ce7919b
fix style
BuyuanCui Jun 29, 2023
4078618
fix style
BuyuanCui Jun 29, 2023
d5da2d4
Merge branch 'zh_itn' of github.com:NVIDIA/NeMo-text-processing into …
BuyuanCui Jun 29, 2023
3af3141
fixing pr checks
BuyuanCui Jun 29, 2023
f9c6d15
removed // for zhtn/itn cache
BuyuanCui Jun 30, 2023
820b80d
Update inverse_normalize.py
BuyuanCui Jun 30, 2023
fd6fdcc
Merge branch 'main' into zh_itn
BuyuanCui Jun 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ pipeline {
RU_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
VI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
SV_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
ZH_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
ZH_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-29-23-0'
DEFAULT_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'

}
Expand Down Expand Up @@ -319,11 +319,11 @@ pipeline {
sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/text_normalization/normalize.py --lang=zh --text="你" --cache_dir ${ZH_TN_CACHE}'
}
}
// stage('L0: ZH ITN grammars') {
// steps {
// sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --lang=zh --text="二零零二年一月二十八日 " --cache_dir ${ZH_TN_CACHE}'
// }
// }
stage('L0: ZH ITN grammars') {
steps {
sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --lang=zh --text="二零零二年一月二十八日 " --cache_dir ${ZH_TN_CACHE}'
}
}
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,11 @@ def __init__(
from nemo_text_processing.inverse_text_normalization.es_en.verbalizers.verbalize_final import (
VerbalizeFinalFst,
)
elif lang == 'zh': # Mandarin
from nemo_text_processing.inverse_text_normalization.zh.taggers.tokenize_and_classify import ClassifyFst
from nemo_text_processing.inverse_text_normalization.zh.verbalizers.verbalize_final import (
VerbalizeFinalFst,
)

self.tagger = ClassifyFst(
cache_dir=cache_dir, whitelist=whitelist, overwrite_cache=overwrite_cache, input_case=input_case
Expand Down Expand Up @@ -150,7 +155,7 @@ def parse_args():
parser.add_argument(
"--language",
help="language",
choices=['en', 'de', 'es', 'pt', 'ru', 'fr', 'sv', 'vi', 'ar', 'es_en'],
choices=['en', 'de', 'es', 'pt', 'ru', 'fr', 'vi', 'ar', 'es_en', 'zh'],
default="en",
type=str,
)
Expand Down
17 changes: 17 additions & 0 deletions nemo_text_processing/inverse_text_normalization/zh/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from nemo_text_processing.inverse_text_normalization.zh.taggers.tokenize_and_classify import ClassifyFst
from nemo_text_processing.inverse_text_normalization.zh.verbalizers.verbalize import VerbalizeFst
from nemo_text_processing.inverse_text_normalization.zh.verbalizers.verbalize_final import VerbalizeFinalFst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
一 1
二 2
三 3
四 4
五 5
六 6
七 7
八 8
九 9
十 10
十一 11
十二 12
十三 13
十四 14
十五 15
十六 16
十七 17
十八 18
十九 19
二十 20
二十一 21
二十二 22
二十三 23
二十四 24
二十五 25
二十六 26
二十七 27
二十八 28
二十九 29
三十 30
三十一 31
壹 1
貳 2
參 3
肆 4
伍 5
陸 6
柒 7
捌 8
玖 9
幺 1
两 2
兩 2
拾 10
拾壹 11
拾貳 12
拾叁 13
拾肆 14
拾伍 15
拾陸 16
拾柒 17
拾捌 18
拾玖 19
貳拾 20
貳拾壹 21
貳拾貳 22
貳拾叁 23
貳拾肆 24
貳拾伍 25
貳拾陸 26
貳拾柒 27
貳拾捌 28
貳拾玖 29
叁拾 30
叁拾壹 31
壹 1
拾壹 11
贰拾壹 21
贰 2
陆 6
拾贰 12
拾陆 16
贰拾贰 22
贰拾陆 26
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
一 1
二 2
三 3
四 4
五 5
六 6
七 7
八 8
九 9
十 10
十一 11
十二 12
一十 10
零一 1
零二 2
零三 3
零四 4
零五 5
零六 6
零七 7
零八 8
零九 9
壹 1
贰 2
叁 3
肆 4
伍 5
陆 6
柒 7
捌 8
玖 9
拾 10
拾壹 11
拾贰 12
壹拾 10
零壹 1
零贰 2
零叁 3
零肆 4
零伍 5
零陆 6
零柒 7
零捌 8
零玖 9
貳 2
零貳 2
陸 6
零陸 6
拾貳 12
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
美元 US$
欧元 €
歐元 €
英镑 £
英鎊 £
加拿大元 CAD$
加拿大币 CAD$
加拿大幣 CAD$
加元 CAD$
加币 CAD$
加幣 CAD$
瑞士法郎 Fr
法郎 ₣
圆 ¥
圓 ¥
瑞典克朗 Kr
墨西哥比索 NXN$
新西兰元 NZD$
新西蘭元 NZD$
新加坡币 SGD$
新加坡幣 SGD$
新加坡元 SGD$
港元 HKD$
港币 HKD$
港幣 HKD$
挪威克朗 NOKkr
韩元 ₩
韓元 ₩
韩币 ₩
韓幣 ₩
土耳其里拉 TRY₺
印度卢布 ₹
印度盧布 ₹
印度卢比 ₹
印度盧比 ₹
俄罗斯卢布 ₽
俄羅斯盧布 ₽
俄罗斯卢比 ₽
俄羅斯盧比 ₽
巴西雷亚尔 BRLR$
巴西雷亞爾 BRLR$
南非兰特 R
南非蘭特 R
丹麦克朗 DKKkr
丹麥克朗 DKKkr
波兰兹罗提 zł
波蘭兹儸提 zł
新台币 TWDNT$
新臺幣 TWDNT$
泰铢 ฿
泰銖 ฿
马来西亚林吉特 RM
馬來西亞林吉特 RM
印尼盾 Rp
匈牙利福林 Ft
捷克克朗 Kč
以色列新谢克尔 ₪
以色列新謝克爾 ₪
智利披索 CLP$
菲律宾披索 ₱
菲律賓披索 ₱
阿联酋迪拉姆 د.إ
阿聯酋迪拉姆 د.إ
哥伦比亚披索 COL$
哥倫比亞披索 COL$
马来西亚令吉 RM
馬來西亞令吉 RM
罗马尼亚列伊 L
羅馬尼亞列伊 L
日元 JPY¥
日圆 JPY¥
日圓 JPY¥
人民币 ¥
人民幣 ¥
元 ¥
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
美分 US$
欧分 €
便士 £
加拿大分 CAD$
生丁 ₣
瑞典欧尔 KrOre
分 NXN$
新西兰仙 NZD$
挪威欧尔 NOKOre
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
分 ¥
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
毛 ¥
角 ¥
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
一 1
二 2
三 3
四 4
五 5
六 6
七 7
八 8
九 9
壹 1
贰 2
叁 3
肆 4
伍 5
陆 6
柒 7
捌 8
玖 9
貳 2
陸 6
两 2
兩 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
二十 2
三十 3
四十 4
五十 5
六十 6
七十 7
八十 8
九十 9
贰拾 2
叁拾 3
肆拾 4
伍拾 5
陆拾 6
柒拾 7
捌拾 8
玖拾 9
貳拾 2
陸拾 6
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
零 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading