Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
160 commits
Select commit Hold shift + click to select a range
743a9a5
Merge r1.13.0 main (#5570)
ericharper Dec 8, 2022
e7c82f5
Optimized loop and bugfix in SDE (#5573)
Jorjeous Dec 8, 2022
0ce4428
Update torchmetrics (#5566)
nithinraok Dec 8, 2022
b0fe80c
remove useless files. (#5580)
XuesongYang Dec 9, 2022
b76dc7e
add initial NFA code
erastorgueva-nv Dec 8, 2022
d397c22
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 8, 2022
c39bd56
Make use of the specified device during viterbi decoding
erastorgueva-nv Dec 9, 2022
e4708a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 9, 2022
b5a200c
Fix CodeQL notes
erastorgueva-nv Dec 9, 2022
0ec1093
Fix CodeQL warning
erastorgueva-nv Dec 9, 2022
53b7467
Add an option to defer data setup from ``__init__`` to ``setup`` (#5569)
anteju Dec 9, 2022
303d7f5
Make utt_id specified by number of parts of audio_filepath user wishe…
erastorgueva-nv Dec 9, 2022
b8398f5
remove audio_sr TODO - reduce risk of silent bugs
erastorgueva-nv Dec 9, 2022
db77d8f
Add check that model is CTC
erastorgueva-nv Dec 9, 2022
19cdbe4
Remove unused import
erastorgueva-nv Dec 9, 2022
5b8aacc
Text generation improvement (UI client, data parallel support) (#5437)
yidong72 Dec 9, 2022
f02c88f
Add align function docstrings and make most args optional
erastorgueva-nv Dec 9, 2022
698305b
Remove redundant returns of viterbi and log probs matrices
erastorgueva-nv Dec 10, 2022
4fa2fd1
Rename h# to <initial_silence>
erastorgueva-nv Dec 10, 2022
7aad560
Update manifest format description in README
erastorgueva-nv Dec 10, 2022
e42d121
always remove any spaces from utt_id
erastorgueva-nv Dec 10, 2022
7d7e529
Patch the hanging of threads on very large stderr (#5589) (#5590)
github-actions[bot] Dec 10, 2022
d4aae2a
O2 style amp for gpt3 ptuning (#5246)
JimmyZhang12 Dec 10, 2022
634cd2d
Better patch hydra (#5591) (#5592)
github-actions[bot] Dec 10, 2022
c32fdaa
Yet another fix with hydra multirun (#5594) (#5595)
github-actions[bot] Dec 10, 2022
9adfbf8
Add RETRO model documentation (#5578)
yidong72 Dec 12, 2022
8ae60f5
Fix: setup_multiple validation/test data (#5585)
anteju Dec 12, 2022
99dc03b
Move to optimizer based EMA implementation (#5169)
SeanNaren Dec 12, 2022
1250c36
AIStore for ASR datasets (#5462)
anteju Dec 12, 2022
d4f3cca
Add support for MHA adapters to ASR (#5396)
titu1994 Dec 12, 2022
199ff16
Remove unused TTS eval functions w/ pesq and pystoi dependencies (#56…
github-actions[bot] Dec 13, 2022
675ec6b
Create separator parameter
erastorgueva-nv Dec 13, 2022
ad4fd5c
Call align function with hydra config
erastorgueva-nv Dec 13, 2022
191695a
update usage example
erastorgueva-nv Dec 13, 2022
9e84910
Update Dockerfile (#5614) (#5616)
github-actions[bot] Dec 13, 2022
3a6adb5
Make separate pretrained_name and model_path parameters
erastorgueva-nv Dec 13, 2022
d979234
make "optional" tags bold in markdown
erastorgueva-nv Dec 13, 2022
9c6c6db
Move non-main functions to utils dir
erastorgueva-nv Dec 13, 2022
67093df
Temp workaround: Disable test with cache_audio=True since it is faili…
anteju Dec 13, 2022
f4cfa47
[TTS] fix ranges of char set for accented letters. (#5607)
XuesongYang Dec 14, 2022
1d38520
Change success message to reduce confusion (#5621)
SeanNaren Dec 14, 2022
67dffe4
Update documentation and tutorials for Adapters (#5610)
titu1994 Dec 14, 2022
00d8ec8
[TTS] add type hints and change varialbe names for tokenizers and g2p…
XuesongYang Dec 14, 2022
676a5ea
1. Added missing import for gather_objects. (#5627)
michalivne Dec 14, 2022
78488a7
[TTS][ZH] add fastpitch and hifigan model NGC urls and update NeMo do…
github-actions[bot] Dec 15, 2022
9b59d16
Fixed RadTTS unit test (#5572)
borisfom Dec 15, 2022
9c50507
remove tests (#5633)
ericharper Dec 15, 2022
8215569
[TTS][DOC] add notes about automatic conversion to target sampling ra…
github-actions[bot] Dec 15, 2022
f306269
Conformer local attention (#5525)
sam1373 Dec 15, 2022
d7e84cb
Add core classes and functions for online clustering diarizer part 1 …
tango4j Dec 15, 2022
9a5e5f8
[STT] Add Esperanto (Eo) ASR Conformer-CTC and Conformer-Transducer m…
github-actions[bot] Dec 15, 2022
eea669e
Removed unused import
erastorgueva-nv Dec 15, 2022
9dd2b11
Specify that filepaths need to be absolute
erastorgueva-nv Dec 15, 2022
9952524
replaces any spaces in utt_id with dashes
erastorgueva-nv Dec 15, 2022
7b5fcce
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 15, 2022
8f4ecea
Make hydra script callable by another script
erastorgueva-nv Dec 15, 2022
2703734
do not specify default model or model_downsample_factor
erastorgueva-nv Dec 15, 2022
6dbc849
[Dockerfile] Remove AIS archive from docker image (#5629)
anteju Dec 15, 2022
941f85e
Measure audio_sr from audio instead of needing to specify
erastorgueva-nv Dec 15, 2022
44a856d
[TTS][ZH] Disambiguate polyphones with augmented dict and Jieba segme…
yuekaizhang Dec 15, 2022
b635d4e
Make separate parameters for device of transcription and viterbi steps
erastorgueva-nv Dec 15, 2022
5d30692
Add mention of gecko
erastorgueva-nv Dec 15, 2022
db480d1
[workflow] add exclude labels option to ignore cherry-picks in releas…
XuesongYang Dec 16, 2022
189a44d
[TTS][ZH] bugfix for the tutorial and add NGC CLI installation guide.…
github-actions[bot] Dec 16, 2022
2733d54
[Add] ASR+VAD Inference Pipeline (#5575)
stevehuang52 Dec 16, 2022
40323ae
rename separator to ctm_grouping_separator and refactor
erastorgueva-nv Dec 16, 2022
77300e9
Bert interleaved (#5556)
shanmugamr1992 Dec 16, 2022
dbefd26
Add duration padding support for RADTTS inference (#5650)
kevjshih Dec 16, 2022
c5ed085
Add remove_blank_tokens_from_ctm parameter
erastorgueva-nv Dec 16, 2022
c3899a0
Dont save initial_silence line in CTM
erastorgueva-nv Dec 16, 2022
0087dad
Add DLLogger support to exp_manager (#5658)
milesial Dec 17, 2022
c56e190
add minimum_timestamp_duration parameter
erastorgueva-nv Dec 17, 2022
312e440
add suggestion about removing blanks to README
erastorgueva-nv Dec 17, 2022
73f796c
reorder args
erastorgueva-nv Dec 17, 2022
612ac2e
clarify description of ctm_grouping_separator in README
erastorgueva-nv Dec 17, 2022
460f82b
update docstring
erastorgueva-nv Dec 17, 2022
2c6c4ef
[TTS][ZH] bugfix for ngc cli installation. (#5652) (#5664)
github-actions[bot] Dec 17, 2022
b6da516
Port stateless timer to exp manager (#5584)
MaximumEntropy Dec 17, 2022
0a5463b
Fix EMA restart by allowing device to be set by the class init (#5668)
SeanNaren Dec 19, 2022
c708b9d
Remove SDP (moved to separate repo) - merge to main (#5630)
erastorgueva-nv Dec 19, 2022
247dc27
Add interface for making amax reduction optional for FP8 (#5447)
ksivaman Dec 20, 2022
6b13cd5
[TTS] add tts dict cust notebook (#5662)
ekmb Dec 20, 2022
77da321
[ASR] Audio processing base, multi-channel enhancement models (#5356)
anteju Dec 20, 2022
e421078
Expose ClusteringDiarizer device (#5681)
SeanNaren Dec 20, 2022
7a1319b
Add Beam Search support to ASR transcribe() (#5443)
titu1994 Dec 20, 2022
4eead4f
Propagate attention_dropout flag for GPT-3 (#5669)
mikolajblaz Dec 20, 2022
2d21316
Enc-Dec model size reporting fixes (#5623)
MaximumEntropy Dec 21, 2022
d3c15b0
Multiblank Transducer (#5527)
hainan-xv Dec 21, 2022
b71836b
[TTS][ZH] fix broken link for the script. (#5680)
github-actions[bot] Dec 21, 2022
effc7f5
[TN/TTS docs] TN customization, g2p docs moved to tts (#5683)
ekmb Dec 21, 2022
9fa98d5
Add prompt learning tests (#5649)
arendu Dec 21, 2022
e25b5dc
remove output (#5689) (#5690)
github-actions[bot] Dec 21, 2022
8a4b45b
Minor fixes (#5691)
MaximumEntropy Dec 21, 2022
bdaf431
temp disbale speaker reco CI (#5696)
fayejf Dec 24, 2022
4e292c9
some tokenizers do not have additional_special_tokens_ids attribute (…
github-actions[bot] Dec 24, 2022
2cab473
Bump setuptools from 59.5.0 to 65.5.1 in /requirements (#5704)
dependabot[bot] Dec 27, 2022
17ac65e
Merge 1.14.0 main (#5705)
ericharper Dec 28, 2022
0bb6aa6
Don't print exp_manager warning when max_steps == -1 (#5725)
milesial Jan 3, 2023
13065f3
pin torchmetrics version (#5720)
nithinraok Jan 4, 2023
4bb9e89
Update to pytorch 22.12 container (#5694)
ericharper Jan 4, 2023
e7dd06d
add keep_initializers_as_inputs to _export method (#5731)
pks Jan 4, 2023
9a02343
added tab former doc to the index page (#5733)
yidong72 Jan 4, 2023
a91e7e3
ALiBi Positional Embeddings (#5467)
michalivne Jan 4, 2023
d67cb7c
Ensure EMA checkpoints are also deleted when normal checkpoints are (…
SeanNaren Jan 5, 2023
5e09cb4
Fix P-Tuning Truncation (#5663)
vadam5 Jan 5, 2023
23d3e30
Update 00_NeMo_Primer.ipynb (#5740)
schaltung Jan 5, 2023
90fc373
Support non-standard padding token id (#5543)
Numeri Jan 6, 2023
9f46caf
typo and link fixed (#5741) (#5744)
github-actions[bot] Jan 6, 2023
29e36bb
link fixed (#5745) (#5746)
github-actions[bot] Jan 6, 2023
e457a0d
[TTS] Update Spanish TTS model to 1.15 (#5742)
rlangman Jan 6, 2023
1a413c1
Fix for incorrect computation of batched alignment in transducers (#5…
Kipok Jan 6, 2023
33af65d
Move Attention and MLP classes to a separate file in Megatron transfo…
MaximumEntropy Jan 6, 2023
e28fcc9
Adithyare/prompt learning seed (#5749)
arendu Jan 7, 2023
de82689
Set the stream position to 0 for pydub (#5752)
jonghwanhyeon Jan 8, 2023
c0b36db
Fix: conformer encoder forward when length is None (#5761)
anteju Jan 10, 2023
4190144
Update Tacotron2 NGC checkpoint load to latest version (#5760) (#5762)
github-actions[bot] Jan 10, 2023
d369a0e
[TTS][DE] refine grapheme-based tokenizer and fastpitch training reci…
XuesongYang Jan 10, 2023
0239a57
Refactor so token, word and additonal segment-level alignments are ge…
erastorgueva-nv Jan 10, 2023
0236daa
change CTM rounding to remove unnecessary decimal figures
erastorgueva-nv Jan 10, 2023
31edb15
Move obtaining start and end of batch line IDs to separate util function
erastorgueva-nv Jan 10, 2023
6258cc3
Sanitize params before DLLogger log_hyperparams (#5736)
milesial Jan 10, 2023
71ea999
Allow to run alignment on transcribed pred_text
erastorgueva-nv Jan 10, 2023
a1642a5
Update README
erastorgueva-nv Jan 10, 2023
466ea8c
update README
erastorgueva-nv Jan 10, 2023
6c2b043
Rename output_ctm_folder to output_dir
erastorgueva-nv Jan 10, 2023
5b65827
rename n_parts_for_ctm to audio_filepath_parts_in_utt_id
erastorgueva-nv Jan 10, 2023
bb32cea
Rename some variables to improve readability
erastorgueva-nv Jan 10, 2023
cdb1876
move constants to separate file
erastorgueva-nv Jan 10, 2023
20eac8e
Add extra data args to support proper finetuning of HF converted T5 c…
MaximumEntropy Jan 10, 2023
94d4c50
Rename some functions
erastorgueva-nv Jan 10, 2023
26bf464
update year
erastorgueva-nv Jan 10, 2023
d928171
No-script TS export, prepared for ONNX export (#5653)
borisfom Jan 11, 2023
1c5586d
ASR evaluator (#5728)
fayejf Jan 11, 2023
b672a3a
Docs g2p update (#5769) (#5775)
github-actions[bot] Jan 11, 2023
7e8ba51
adding back tar script for decoder dataset for duplex (#5773)
yzhang123 Jan 11, 2023
015d36e
[ASR][Test] Enable test for cache audio with a single worker (#5763)
anteju Jan 12, 2023
e489832
Fixing masking in RadTTS bottleneck layer (#5771)
borisfom Jan 12, 2023
e35d042
Update torchaudio dependency version for tutorials (#5781) (#5782)
github-actions[bot] Jan 12, 2023
db1fd2b
[TTS][ZH] bugfix import jieba errors. (#5776) (#5784)
github-actions[bot] Jan 12, 2023
aac3d97
fix typos
erastorgueva-nv Jan 12, 2023
3afb0e4
Merge branch 'main' into nemo_forced_aligner
erastorgueva-nv Jan 12, 2023
da67edd
Merge branch 'main' into nemo_forced_aligner
erastorgueva-nv Jan 12, 2023
58a461e
update requirements.txt
erastorgueva-nv Jan 12, 2023
656bdac
Merge branch 'main' into nemo_forced_aligner
erastorgueva-nv Jan 13, 2023
22d1a39
Make default devices None and set to GPU if it is available
erastorgueva-nv Jan 13, 2023
c8e692e
add warning for non-zero minimum_timestamp_duration
erastorgueva-nv Jan 13, 2023
1a298d5
Clarify phrasing in README regarding raising error if pred_text exists
erastorgueva-nv Jan 13, 2023
47acf80
Update README section on evaluating alignment accuracy
erastorgueva-nv Jan 13, 2023
211dc06
fix some code in creating segments
erastorgueva-nv Jan 13, 2023
00b38c4
Add some unit tests for NFA boundary_info creation
erastorgueva-nv Jan 13, 2023
c99750c
Merge branch 'main' into nemo_forced_aligner
erastorgueva-nv Jan 13, 2023
6d9c683
Added test for function adding t_start and t_end
erastorgueva-nv Jan 13, 2023
dd516e9
add comments to get_y_and_boundary_info_for_utt and remove redundant …
erastorgueva-nv Jan 13, 2023
55731fd
add comments to get_batch_tensors_and_boundary_info
erastorgueva-nv Jan 13, 2023
7f613ec
Add comments to make_output_files.py
erastorgueva-nv Jan 13, 2023
989e273
add comments to viterbi decoding code
erastorgueva-nv Jan 13, 2023
fd7b339
Merge branch 'main' into nemo_forced_aligner
erastorgueva-nv Jan 13, 2023
4896c17
Merge branch 'main' into nemo_forced_aligner
erastorgueva-nv Jan 17, 2023
e11258d
Add copyright headers
erastorgueva-nv Jan 17, 2023
3bd71be
Change req to nemo_toolkit[all]
erastorgueva-nv Jan 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions tools/nemo_forced_aligner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# NeMo Forced Aligner (NFA)

A tool for doing Forced Alignment using Viterbi decoding of NeMo CTC-based models.

## Usage example

``` bash
python <path_to_NeMo>/tools/nemo_forced_aligner/align.py \
pretrained_name="stt_en_citrinet_1024_gamma_0_25" \
model_downsample_factor=8 \
manifest_filepath=<path to manifest of utterances you want to align> \
output_dir=<path to where your ctm files will be saved>
```

## How do I use NeMo Forced Aligner?
To use NFA, all you need to provide is a correct NeMo manifest (with `"audio_filepath"` and `"text"` fields).

Call the `align.py` script, specifying the parameters as follows:

* `pretrained_name`: string specifying the name of a CTC NeMo ASR model which will be automatically downloaded from NGC and used for generating the log-probs which we will use to do alignment. Any Quartznet, Citrinet, Conformer CTC model should work, in any language (only English has been tested so far). If `model_path` is specified, `pretrained_name` must not be specified.
>Note: NFA can only use CTC models (not Transducer models) at the moment. If you want to transcribe a long audio file (longer than ~5-10 mins), do not use Conformer CTC model as that will likely give Out Of Memory errors.

* `model_path`: string specifying the local filepath to a CTC NeMo ASR model which will be used to generate the log-probs which we will use to do alignment. If `pretrained_name` is specified, `model_path` must not be specified.
>Note: NFA can only use CTC models (not Transducer models) at the moment. If you want to transcribe a long audio file (longer than ~5-10 mins), do not use Conformer CTC model as that will likely give Out Of Memory errors.

* `model_downsample_factor`: the downsample factor of the ASR model. It should be 2 if your model is QuartzNet, 4 if it is Conformer CTC, 8 if it is Citrinet.

* `manifest_filepath`: The path to the manifest of the data you want to align, containing `'audio_filepath'` and `'text'` fields. The audio filepaths need to be absolute paths.

* `output_dir`: The folder where to save CTM files containing the generated alignments and new JSON manifest containing paths to those CTM files. There will be one CTM file per utterance (ie one CTM file per line in the manifest). The files will be called `<output_dir>/{tokens,words,additional_segments}/<utt_id>.ctm` and each line in each file will start with `<utt_id>`. By default, `utt_id` will be the stem of the audio_filepath. This can be changed by overriding `audio_filepath_parts_in_utt_id`. The new JSON manifest will be at `<output_dir>/<original manifest file name>_with_ctm_paths.json`.

* **[OPTIONAL]** `align_using_pred_text`: if True, will transcribe the audio using the ASR model (specified by `pretrained_name` or `model_path`) and then use that transcription as the 'ground truth' for the forced alignment. The `"pred_text"` will be saved in the output JSON manifest at `<output_dir>/{original manifest name}_with_ctm_paths.json`. To avoid over-writing other transcribed texts, if there are already `"pred_text"` entries in the original manifest, the program will exit without attempting to generate alignments. (Default: False).

* **[OPTIONAL]** `transcribe_device`: The device that will be used for generating log-probs (i.e. transcribing). If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). If specified `transcribe_device` needs to be a string that can be input to the `torch.device()` method. (Default: `None`).

* **[OPTIONAL]** `viterbi_device`: The device that will be used for doing Viterbi decoding. If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). If specified `transcribe_device` needs to be a string that can be input to the `torch.device()` method.(Default: `None`).

* **[OPTIONAL]** `batch_size`: The batch_size that will be used for generating log-probs and doing Viterbi decoding. (Default: 1).

* **[OPTIONAL]** `additional_ctm_grouping_separator`: the string used to separate CTM segments if you want to obtain CTM files at a level that is not the token level or the word level. NFA will always produce token-level and word-level CTM files in: `<output_dir>/tokens/<utt_id>.ctm` and `<output_dir>/words/<utt_id>.ctm`. If `additional_ctm_grouping_separator` is specified, an additional folder `<output_dir>/{tokens/words/additional_segments}/<utt_id>.ctm` will be created containing CTMs for `addtional_ctm_grouping_separator`-separated segments. (Default: `None`. Cannot be empty string or space (" "), as space-separated word-level CTMs will always be saved in `<output_dir>/words/<utt_id>.ctm`.)
> Note: the `additional_ctm_grouping_separator` will be removed from the ground truth text and all the output CTMs, ie it is treated as a marker which is not part of the ground truth. The separator will essentially be treated as a space, and any additional spaces around it will be amalgamated into one, i.e. if `additional_ctm_grouping_separator="|"`, the following texts will be treated equivalently: `“abc|def”`, `“abc |def”`, `“abc| def”`, `“abc | def"`.

* **[OPTIONAL]** `remove_blank_tokens_from_ctm`: a boolean denoting whether to remove <blank> tokens from token-level output CTMs. (Default: False).

* **[OPTIONAL]** `audio_filepath_parts_in_utt_id`: This specifies how many of the 'parts' of the audio_filepath we will use (starting from the final part of the audio_filepath) to determine the utt_id that will be used in the CTM files. (Default: 1, i.e. utt_id will be the stem of the basename of audio_filepath). Note also that any spaces that are present in the audio_filepath will be replaced with dashes, so as not to change the number of space-separated elements in the CTM files.

* **[OPTIONAL]** `minimum_timestamp_duration`: a float indicating a minimum duration (in seconds) for timestamps in the CTM. If any line in the CTM has a duration lower than the `minimum_timestamp_duration`, it will be enlarged from the middle outwards until it meets the minimum_timestamp_duration, or reaches the beginning or end of the audio file. Note that this may cause timestamps to overlap. (Default: 0, i.e. no modifications to predicted duration).

# Input manifest file format
By default, NFA needs to be provided with a 'manifest' file where each line specifies the absolute "audio_filepath" and "text" of each utterance that you wish to produce alignments for, like the format below:
```json
{"audio_filepath": "/absolute/path/to/audio.wav", "text": "the transcription of the utterance"}
```

You can omit the `"text"` field from the manifest if you specify `align_using_pred_text=true`. In that case, any `"text"` fields in the manifest will be ignored: the ASR model at `pretrained_name` or `model_path` will be used to transcribe the audio and obtain `"pred_text"`, which will be used as the 'ground truth' for the forced alignment process. The `"pred_text"` will also be saved in the output manifest JSON file at `<output_dir>/<original manifest file name>_with_ctm_paths.json`. To remove the possibility of overwriting `"pred_text"`, NFA will raise an error if `align_using_pred_text=true` and there are existing `"pred_text"` fields in the original manifest.

> Note: NFA does not require `"duration"` fields in the manifest, and can align long audio files without running out of memory. Depending on your machine specs, you can align audios up to 5-10 minutes on Conformer CTC models, up to around 1.5 hours for QuartzNet models, and up to several hours for Citrinet models. NFA will also produce better alignments the more accurate the ground-truth `"text"` is.


# Output CTM file format
For each utterance specified in a line of `manifest_filepath`, several CTM files will be generated:
* a CTM file containing token-level alignments at `<output_dir>/tokens/<utt_id>.ctm`,
* a CTM file containing word-level alignments at `<output_dir>/words/<utt_id>.ctm`,
* if `additional_ctm_grouping_separator` is specified, there will also be a CTM file containing those segments at `output_dir/additional_segments`.
Each CTM file will contain lines of the format:
`<utt_id> 1 <start time in samples> <duration in samples> <text, ie token/word/segment>`.
Note the second item in the line (the 'channel ID', which is required by the CTM file format) is always 1, as NFA operates on single channel audio.

# Output JSON manifest file format
A new manifest file will be saved at `<output_dir>/<original manifest file name>_with_ctm_paths.json`. It will contain the same fields as the original manifest, and additionally:
* `"token_level_ctm_filepath"`
* `"word_level_ctm_filepath"`
* `"additonal_segment_level_ctm_filepath"` (if `additional_ctm_grouping_separator` is specified)
* `"pred_text"` (if `align_using_pred_text=true`)


# How do I evaluate the alignment accuracy?
Ideally you would have some 'true' CTM files to compare with your generated CTM files. With these you could obtain metrics such as the mean (absolute) errors between predicted starts/ends and the 'true' starts/ends of the segments.

Alternatively (or additionally), you can visualize the quality of alignments using tools such as Gecko, which can play your audio file and display the predicted alignments at the same time. The Gecko tool requires you to upload an audio file and at least one CTM file. The Gecko tool can be accessed here: https://gong-io.github.io/gecko/. More information about the Gecko tool can be found on its Github page here: https://github.com/gong-io/gecko.

**Note**: the following may help improve your experience viewing the CTMs in Gecko:
* setting `minimum_timestamp_duration` to a larger number, as Gecko may not display some tokens/words/segments properly if their timestamps are too short.
* setting `remove_blank_tokens_from_ctm=true` if you are analyzing token-level CTMs, as it will make the Gecko visualization less cluttered.
Loading