Skip to content

[TTS][ja-JP] g2p and tokenizer.#9879

Merged
XuesongYang merged 6 commits intomainfrom
jpg2p_jun18
Jul 26, 2024
Merged

[TTS][ja-JP] g2p and tokenizer.#9879
XuesongYang merged 6 commits intomainfrom
jpg2p_jun18

Conversation

@XuesongYang
Copy link
Collaborator

@XuesongYang XuesongYang commented Jul 25, 2024

previously, #9538 was merged, but #9874 recently reverted it from main due to the blocks of building container. Diving into root causes, realized that it was mostly because of adding cutlet in requirements/requirements_tts.txt that blocks the container build. When fixing it, I also found the unit test did not capture the case when janome word segmenter is applied (this should be used by default).

Submit this PR again after fixing CICD failed tasks and bugfixes of unit tests.

  • adding japanese text preprocessing
  • japanese phoneme tokenizer
  • japanese tests
  • japanese g2p model
  • japanese word to ipa dictionary
  • add requirements

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@XuesongYang
Copy link
Collaborator Author

add @BuyuanCui for vis.

BuyuanCui and others added 5 commits July 26, 2024 02:41
* japanese phoneme tokenizer
* japanese tests
* japanese g2p model
* japanese word to ipa dictionary
* add requirements

Signed-off-by: Alex Cui <alexcui1994@gmail.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
@XuesongYang XuesongYang merged commit 491b577 into main Jul 26, 2024
@XuesongYang XuesongYang deleted the jpg2p_jun18 branch July 26, 2024 20:22
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jul 30, 2024
* * adding japanese text preprocessing
* japanese phoneme tokenizer
* japanese tests
* japanese g2p model
* japanese word to ipa dictionary
* add requirements

Signed-off-by: Alex Cui <alexcui1994@gmail.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: add prefix for ascii letters and fix unit test.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: remove tone_list.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

* fix: unit test

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fix: wrong format

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: Alex Cui <alexcui1994@gmail.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
xuanzic pushed a commit to xuanzic/NeMo that referenced this pull request Aug 1, 2024
* * adding japanese text preprocessing
* japanese phoneme tokenizer
* japanese tests
* japanese g2p model
* japanese word to ipa dictionary
* add requirements

Signed-off-by: Alex Cui <alexcui1994@gmail.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: add prefix for ascii letters and fix unit test.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: remove tone_list.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

* fix: unit test

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fix: wrong format

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: Alex Cui <alexcui1994@gmail.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Vivian Chen <xuanzic@example.com>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* * adding japanese text preprocessing
* japanese phoneme tokenizer
* japanese tests
* japanese g2p model
* japanese word to ipa dictionary
* add requirements

Signed-off-by: Alex Cui <alexcui1994@gmail.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: add prefix for ascii letters and fix unit test.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: remove tone_list.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

* fix: unit test

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fix: wrong format

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: Alex Cui <alexcui1994@gmail.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* * adding japanese text preprocessing
* japanese phoneme tokenizer
* japanese tests
* japanese g2p model
* japanese word to ipa dictionary
* add requirements

Signed-off-by: Alex Cui <alexcui1994@gmail.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: add prefix for ascii letters and fix unit test.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: remove tone_list.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

* fix: unit test

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fix: wrong format

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: Alex Cui <alexcui1994@gmail.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
XuesongYang added a commit to paarthneekhara/NeMo that referenced this pull request Jan 18, 2025
* * adding japanese text preprocessing
* japanese phoneme tokenizer
* japanese tests
* japanese g2p model
* japanese word to ipa dictionary
* add requirements

Signed-off-by: Alex Cui <alexcui1994@gmail.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: add prefix for ascii letters and fix unit test.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* bugfix: remove tone_list.

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>

* fix: unit test

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

* fix: wrong format

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

---------

Signed-off-by: Alex Cui <alexcui1994@gmail.com>
Signed-off-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: XuesongYang <XuesongYang@users.noreply.github.com>
Co-authored-by: Alex Cui <alexcui1994@gmail.com>
Co-authored-by: BuyuanCui <BuyuanCui@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: XuesongYang <XuesongYang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments