From f17c37dba1f02b62fc8b0ecd298803a88b133b73 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Fri, 28 Oct 2022 10:47:37 -0700 Subject: [PATCH 01/16] Add details to SDP README.md Signed-off-by: Elena Rastorgueva --- tools/speech_dataset_processor/README.md | 164 ++++++++++++++++++++++- 1 file changed, 161 insertions(+), 3 deletions(-) diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index 992c1da656e5..a7494c3b3af2 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -1,7 +1,165 @@ # Speech Dataset Processor -Toolkit to make it easy to write and share the steps for processing a speech dataset. +Speech Dataset Processor (SDP) is a toolkit to make it easy to: +1. write code to process a new dataset, minimizing the amount of boilerplate code required. +2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file. -This toolkit contains many of the most common speech dataset processing operations. To process a new dataset, you simply need to write a YAML file containing the parameters needed for dataset processing. It is also easy to add your own code for various speech dataset processing steps if needed. +SDP's philosophy is to represent processing operations as 'processor' classes. Many common processing operations are provided, and it is easy to add your own. In some cases, all you will need to do to process a new dataset is simply to write a YAML file containing the parameters needed to process your dataset. -TBD +SDP is specifically intended for the use case when you have an existing dataset with the audio & text pairs already specified in some form, and you wish to create a JSON manifest suitable for use with NeMo. SDP allows for intermediate cleaning and filtering steps which involve amending the 'ground truth' `"text"` or dropping utterances which are deemed to be too inaccurate for training on. + +# Overview of how SDP processes a dataset +1. You call the main.py script, passing in a config.yaml file, possibly with some overrides. +2. main.py script calls run_processors.py, passing in your config. +3. run_processors.py does the following: + + a. picks out the processors you wish to run (you can specify a subset of the processors in the config override, e.g. to avoid re-running time-consuming steps). + + b. if some of the processors have not had "output_manifest_file" or "input_manfiest_file" entries specified, SDP will automatically create temporary files for those. + + c. instantiate the processor classes using `hydra.utils.instantiate` + + d. run the run-time processor tests by calling the `processor.test()` method. + + e. run the processing method (`processor.process()`) of each processor in order. + + +# Layout of config YAML files + +The YAML config file for processing a dataset must contain a key `processors`, the value of which is a list. Each item in that list is expected to be a dictionary specifying a processor class, i.e. it must have a key `_target_`, the value of which is a path to a "processor" class, and the remaining keys must be the kwargs necessary to instantiate that class with `hydra.utils.instantiate()` (c.f. https://hydra.cc/docs/advanced/instantiate_objects/overview/). + +SDP will run the processors specified in the `processors` list in the config file. It will also check for a `processors_to_run` key in the config file, which can be either the string "all", or any Python "slice" object. + +> **Note**: SDP will run the processors in the order in which they are listed in the config YAML file. Make sure to list the processors in an order which makes sense, e.g. create an initial manifest first; make sure to run asr inference before doing any processing which looks at `pred_text` fields in the manifest. + +# Processor classes + +## `Base Processor` +All processor classes inherit from the `BaseProcessor` class. This is a very simple abstract class which has 2 empty methods: `process()` and `test()`. These serve to remind us that SDP essentially just runs `test()` on all processors, and then `process()` on all processors. + +`ASRInference` is a child class of `BaseProcessor`. It has a simple `process()` method which runs transcription on every utterance in the input_manifest. + +`WriteManifest` is also a child class of `BaseProcessor`. It has a simple `process()` method which saves a copy of the input manifest containing only the fields specified in `fields_to_save`. + +## `BaseParallelProcessor` +`BaseParallelProcessor` inherits from the `BaseProcessor` class. Within its `.process()` method, it calls other methods and functions, which allow it to do more complex processing. Most importantly, it calls its `.process_dataset_entry(data_entry)` method on every utterance in the manifest, and it does this in parallel, allowing for more efficient processing. + +### What is a `DataEntry`? +As mentioned above, `BaseParallelProcessor.process_dataset_entry(data_entry)` is called on a variable called `data_entry` which represents an utterance in our dataset. In most cases, `data_entry` is a `DataEntry` object, which represents a line in a dataset manifest. +> The only exception to the above is in processors which are run at the start of processing when we are creating a manifest for the first time (such as `CreateInitialManifestMLS`, in which the `data_entry` variable is a string containing a line for that utterance from the original raw MLS transcript). + +The `DataEntry` class is a dataclass which contains 2 attributes: +1. `data` is an Optional dictionary containing items which represent the JSON manifest entry. `data` can also be `None`. If a `.process_dataset_entry(data_entry)` method returns a `DataEntry` class where `data == None`, then that utterance will be dropped from the output manifest. +2. `metrics`, which can be of any type, and are `None` by default. This variable is used by some variables to record summary statistics about the changes made to the dataset, these metrics are aggregated and can be displayed once every utterance has been processed by the processor. + +### What happends in `BaseParallelProcessor.process()` + +We outline the `.process()` method of the `BaseParallelProcessor` class below: + +```mermaid +graph TD; + subgraph "Steps in BaseParallelProcessor.process() method" + A["self.prepare()
empty method by default
can be used to
e.g. download raw data automatically"] --> B + B["self.read_manifest()
reads input manifest
(ie from previous processing step)
if there is one"] --> C + C["self.process_dataset_entry(data_entry)
abstract method"] --> D + D["save output manifest
& aggregate metrics
(e.g. # of utts dropped)"] --> E + E["self.finalize_metrics()
e.g. display metrics from processing"] + + + end + +``` + +## `ModifyManifestTextProcessor` +`ModifyManifestTextProcessor` inherits from the `BaseProcessor` class. It takes in an additional optional parameter `test_cases` and overwrites a few methods: +* `.test()`: this method makes sure that the output from the processor matches the expected output specified in the `test_cases` parameter. +* `.process_dataset_entry(data_entry)`: this method applies processing to a `data_entry`. First, spaces are added to the start and end of the 'text' and 'pred_text' entries (if they exist), then the abstract method `._process_dataset_entry(data_entry)` is called. Then, any extra spaces (e.g. two spaces next to each other ' ') are removed from 'text' and 'pred_text' entries. +* `._process_dataset_entry(data_entry)`: this is an abstract method which will be over-written by children of `ModifyManfiestTextProcessor`. + + +## How to make your own processor classes. + +We will describe how to make your own processor classes by referring to SDP's existing classes. + +### Creating an initial manifest, e.g. as in `CreateInitialManifestMLS`. +`CreateInitialManifestMLS` is a child class of `BaseParallelProcessor`. It downloads raw MLS data for a specified language, and creates an initial manifest (in the format expected by NeMo) which can be cleaned by subsequent processors. + +Its `.prepare()` method downloads and extracts the raw data. + +Its `read_manifest()` method reads the lines in the raw MLS transcript file. + +Its `process_dataset_entry()` method takes in the lines from the raw MLS transcript file, and outputs `DataEntry` objects containing entries that will be saved into the manifest (i.e. `"audio_filepath"`, `"duration"`, `"text"`) for each utterance. + + +### A `ModifyManifestTextProcessor` class that cleans ground truth text, e.g. as in `SubSubstringToSpace`. + +One of the classes provided in SDP is `SubSubstringToSpace`. At initialization, it takes in `substrings`, a list of strings which, if found in the "text", will be converted to spaces. This is helpful for e.g. removing punctuation. + +In its `_process_dataset_entry(data_entry)` method it does the string to space convertion upon the `data_entry` that is input. Its output is a `data_entry` with the changes applied to `data`, and the the metrics of which substrings were spotted and converted to spaces recorded in `metrics`. These metrics will be aggregated over all utterances by the `BaseParallelProcessor` class. `SubSubstringToSpace` also has a `.finalize(metrics)` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. + +### A `ModifyManifestTextProcessor` class that drops incorrectly transcribed utterances, e.g. as in `DropHighLowCharrate`. + +One of the classes provided in SDP is `DropHighLowCharrate`. At initialization, it takes in `high_charrate_threshold` and `low_charrate_threshold`, for which the utterance will be dropped if it is above or below each value respectively. This is helpful for automatically filtering out incorrectly transcribed utterances. + +In its `_process_dataset_entry(data_entry)` method it evaluates the character rate of the utterance. If the character rate is within bounds, it will return the same `data_entry` that was input. If the character rate is out of bounds, it will return a `data_entry` with `data=None` and `metrics` which reflect the applied changes. +Similar to the `SubSubstringToSpace` class, it has a `.finalize(metrics)` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. + +## Class diagram +A diagram of the classes mentioned above is included here. Arrows represent inheritance. +```mermaid +classDiagram +BaseProcessor <|-- BaseParallelProcessor +BaseProcessor <|-- ASRInference +BaseProcessor <|-- WriteManifest + +BaseParallelProcessor <|-- CreateInitialManifestMLS +BaseParallelProcessor <|-- ModifyManifestTextProcessor + +ModifyManifestTextProcessor <|-- SubSubstringToSpace +ModifyManifestTextProcessor <|-- DropHighLowCharrate + +class BaseProcessor{ + output_manifest_file + input_manifest_file + process() + test() +} +class BaseParallelProcessor{ + process() + prepare() + read_manifest() + process_dataset_entry() + finalize() +} +class ASRInference{ + pretrained_model + batch_size + process() +} +class WriteManifest{ + fields_to_save + process() +} +class CreateInitialManifestMLS{ + ... + ....() +} +class ModifyManifestTextProcessor{ + test_cases + test() + process_dataset_entry(data_entry) + _process_dataset_entry(data_entry) +} +class SubSubstringToSpace{ + substrings + _process_dataset_entry(data_entry) + finalize(metrics) +} +class DropHighLowCharrate{ + high_charrate_threshold + low_charrate_threshold + _process_dataet_entry(data_entry) + finalize(metrics) + +} +``` From 7c67407f7749119fcfe943f71fe970d4eb3d3a03 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Fri, 28 Oct 2022 11:50:46 -0700 Subject: [PATCH 02/16] Add docstring to WriteManifest processor Signed-off-by: Elena Rastorgueva --- .../sdp/processors/write_manifest.py | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/tools/speech_dataset_processor/sdp/processors/write_manifest.py b/tools/speech_dataset_processor/sdp/processors/write_manifest.py index 1f2d3ef12f2b..208501b19661 100644 --- a/tools/speech_dataset_processor/sdp/processors/write_manifest.py +++ b/tools/speech_dataset_processor/sdp/processors/write_manifest.py @@ -13,13 +13,23 @@ # limitations under the License. import json +from typing import List from sdp.processors.base_processor import BaseProcessor from tqdm import tqdm class WriteManifest(BaseProcessor): - def __init__(self, output_manifest_file, input_manifest_file, fields_to_save): + """ + Saves a copy of a manifest but only with the fields specified in fields_to_save. + + Args: + output_manifest_file: path of where the output file will be saved. + input_manifest_file: path of where the input file that we will be copying is saved. + fields_to_save: list of the fields in the input manifest that we want to copy over. + The output file will only contain these fields. + """ + def __init__(self, output_manifest_file: str, input_manifest_file: str, fields_to_save: List[str]): self.output_manifest_file = output_manifest_file self.input_manifest_file = input_manifest_file self.fields_to_save = fields_to_save From 2d346f29800feb8bb4b9f84f4cdbf8cc435a9d32 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Fri, 28 Oct 2022 13:54:29 -0700 Subject: [PATCH 03/16] Add docstring to CreateInitialManifestMLS Signed-off-by: Elena Rastorgueva --- .../create_initial_manifest_mls.py | 23 +++++++++++++++++-- .../sdp/processors/write_manifest.py | 1 + 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py b/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py index 1ff9e914fe1b..97f224cb69de 100644 --- a/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py +++ b/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py @@ -25,8 +25,27 @@ class CreateInitialManifestMLS(BaseParallelProcessor): + """ + Downloads and unzips raw MLS data for the specified language, and creates an initial manifest using + the transcripts provided in the raw data. + + Args: + language: the language of the data you wish to be downloaded. This will be used to format the + URL from which we attempt to download the data. + download_dir: the directory where the downloaded data will be saved. + data_split: the data split for which the initial manifest will be created. + resampled_audio_dir: the directory where the resampled (16kHz) wav files will be stored. + use_test_data: if `True`, will use the test data manifest located at `TEST_DATA_PATH` to carry out tests. + """ + def __init__( - self, language, download_dir, resampled_audio_dir, data_split, use_test_data=False, **kwargs, + self, + language: str, + download_dir: str, + resampled_audio_dir: str, + data_split: str, + use_test_data: bool = False, + **kwargs, ): super().__init__(**kwargs) self.language = language @@ -65,7 +84,7 @@ def read_manifest(self): return dataset_entries - def process_dataset_entry(self, data_entry): + def process_dataset_entry(self, data_entry: str): if len(data_entry.split("\t")) != 2: raise RuntimeError(f"have more than one tab in line {data_entry}") diff --git a/tools/speech_dataset_processor/sdp/processors/write_manifest.py b/tools/speech_dataset_processor/sdp/processors/write_manifest.py index 208501b19661..f601985a1647 100644 --- a/tools/speech_dataset_processor/sdp/processors/write_manifest.py +++ b/tools/speech_dataset_processor/sdp/processors/write_manifest.py @@ -29,6 +29,7 @@ class WriteManifest(BaseProcessor): fields_to_save: list of the fields in the input manifest that we want to copy over. The output file will only contain these fields. """ + def __init__(self, output_manifest_file: str, input_manifest_file: str, fields_to_save: List[str]): self.output_manifest_file = output_manifest_file self.input_manifest_file = input_manifest_file From 8bf8dc83a7f31ded8cbd1b1cc7c173fe01f07714 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Fri, 28 Oct 2022 14:25:22 -0700 Subject: [PATCH 04/16] Add ModifyManifestTextProcessor docstring Signed-off-by: Elena Rastorgueva --- .../modify_manifest/modify_manifest.py | 20 +++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py index 5c1c0d808848..192c99a0191b 100644 --- a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py +++ b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py @@ -14,21 +14,29 @@ from abc import abstractmethod -from typing import Dict, List, Optional +from typing import Dict, List, Optional, Type -from sdp.processors.base_processor import BaseParallelProcessor +from sdp.processors.base_processor import BaseParallelProcessor, DataEntry from sdp.utils.edit_spaces import add_start_end_spaces, remove_extra_spaces class ModifyManifestTextProcessor(BaseParallelProcessor): """Base class useful for most "text-only" modifications of the manifest. - Will add the following functionality: - - Add space in the beginning and end of sentence for easier regex-based + This adds the following functionality on top of BaseParallelProcessor + - Adds space in the beginning and end of sentence for easier regex-based processing. - Automatically handles common test cases by comparing input to output values. + Args: + test_cases: an optional list of dicts containing test cases for checking + that the processor makes the changes that we are expecting. + The dicts must have a key 'input', the value of which is a dictionary + containing data which is our test input manifest line, and a key + 'output', the value of which is a dictionary containing data which is + the expected output manifest line. + .. note:: This class only supports one-to-one or one-to-none mappings. """ @@ -54,10 +62,10 @@ def test(self): ) @abstractmethod - def _process_dataset_entry(self, data_entry): + def _process_dataset_entry(self, data_entry: Type[DataEntry]): pass - def process_dataset_entry(self, data_entry): + def process_dataset_entry(self, data_entry: Type[DataEntry]): """Wrapper for 'process_dataset_entry' abstract method. Before 'process_dataset_entry' is called, the function From b1ea49d0fea845dad9979e10374557b1b79d0802 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Fri, 28 Oct 2022 14:35:01 -0700 Subject: [PATCH 05/16] Add ASRInference docstring Signed-off-by: Elena Rastorgueva --- .../sdp/processors/asr_inference.py | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/tools/speech_dataset_processor/sdp/processors/asr_inference.py b/tools/speech_dataset_processor/sdp/processors/asr_inference.py index 6ace462d7e39..98bf43b45b90 100644 --- a/tools/speech_dataset_processor/sdp/processors/asr_inference.py +++ b/tools/speech_dataset_processor/sdp/processors/asr_inference.py @@ -20,7 +20,14 @@ class ASRInference(BaseProcessor): - """This processor perforce ASR inference. + """This processor performs ASR inference on the input manifest. + + Args: + output_manifest: the path to the output manifest. It will be the same as the input manifest, but will + also have "pred_true" entries for every utterance. + input_manifest_file: the path to the input manifest which will be transcribed. + pretrained_model: the name of the pretrained NeMo ASR model which will be used to do inference. + batch_size: the batch size to use for ASR inference. Note that it does not re-use base parallel implementation, since the ASR inference is already run in batches. @@ -29,7 +36,9 @@ class ASRInference(BaseProcessor): parallelization, but that needs to be tested. """ - def __init__(self, output_manifest_file, input_manifest_file, pretrained_model, batch_size=32): + def __init__( + self, output_manifest_file: str, input_manifest_file: str, pretrained_model: str, batch_size: int = 32 + ): self.output_manifest_file = output_manifest_file self.input_manifest_file = input_manifest_file self.script_path = Path(__file__).parents[4] / "examples" / "asr" / "transcribe_speech.py" From eb215203b9ed318bb281a3a36afb4b87daf642a9 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Fri, 28 Oct 2022 15:05:42 -0700 Subject: [PATCH 06/16] Add base_processor docstrings Signed-off-by: Elena Rastorgueva --- .../sdp/processors/base_processor.py | 21 +++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/tools/speech_dataset_processor/sdp/processors/base_processor.py b/tools/speech_dataset_processor/sdp/processors/base_processor.py index a51b3de1178b..2bbad5da6484 100644 --- a/tools/speech_dataset_processor/sdp/processors/base_processor.py +++ b/tools/speech_dataset_processor/sdp/processors/base_processor.py @@ -34,6 +34,17 @@ class DataEntry: class BaseProcessor(ABC): + """ + Abstract class for SDP processors. + + Args + output_manifest_file: path of where the output manifest file will be located. + input_manifest_file: path of where the input manifest file is located. This arg + is optional - some processors may not take in an input manifest because they + need to create an initial manifest from scratch (ie from some transcript file + that is in a format different to the NeMo manifest format). + """ + def __init__(self, output_manifest_file, input_manifest_file=None): self.output_manifest_file = output_manifest_file self.input_manifest_file = input_manifest_file @@ -55,13 +66,15 @@ def test(self): class BaseParallelProcessor(BaseProcessor): """ - TBD + Processor class which allows operations on each utterance to be parallelized. Parallelization + is done using tqdm.contrib.concurrent.process_map. - input_manifest_file should always be specified unless it's the first - processor that reads from original dataset representation. + Args: + max_workers: maximum number of workers that will be spawned during parallel processing. + chunksize: the size of the chunks that will be sent to worker processes. """ - def __init__(self, max_workers=-1, chunksize=100, **kwargs): + def __init__(self, max_workers: int = -1, chunksize: int = 100, **kwargs): super().__init__(**kwargs) if max_workers == -1: max_workers = multiprocessing.cpu_count() From b6d3ef53feff529c1ac43e49606cfe4ffd020b40 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Fri, 28 Oct 2022 15:32:19 -0700 Subject: [PATCH 07/16] Add minimal SDP docs page Signed-off-by: Elena Rastorgueva --- docs/source/tools/intro.rst | 1 + docs/source/tools/speech_dataset_processor.rst | 8 ++++++++ 2 files changed, 9 insertions(+) create mode 100644 docs/source/tools/speech_dataset_processor.rst diff --git a/docs/source/tools/intro.rst b/docs/source/tools/intro.rst index 2a0a062040ec..18a26cf03652 100644 --- a/docs/source/tools/intro.rst +++ b/docs/source/tools/intro.rst @@ -9,5 +9,6 @@ NeMo provides a set of tools useful for developing Automatic Speech Recognitions ctc_segmentation speech_data_explorer + speech_dataset_processor diff --git a/docs/source/tools/speech_dataset_processor.rst b/docs/source/tools/speech_dataset_processor.rst new file mode 100644 index 000000000000..31683c31231c --- /dev/null +++ b/docs/source/tools/speech_dataset_processor.rst @@ -0,0 +1,8 @@ +Speech Dataset Processor +======================== + +Speech Dataset Processor (SDP) is a toolkit to make it easy to: +1. write code to process a new dataset, minimizing the amount of boilerplate code required. +2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file. + +More information about it can be found `here `_. From 96a2fb0b1ca7aed4c879b0899c85fad4bac75207 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Date: Tue, 8 Nov 2022 10:54:32 -0800 Subject: [PATCH 08/16] Update tools/speech_dataset_processor/README.md Co-authored-by: Igor Gitman Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> --- tools/speech_dataset_processor/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index a7494c3b3af2..377ff1a36148 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -49,7 +49,7 @@ As mentioned above, `BaseParallelProcessor.process_dataset_entry(data_entry)` is > The only exception to the above is in processors which are run at the start of processing when we are creating a manifest for the first time (such as `CreateInitialManifestMLS`, in which the `data_entry` variable is a string containing a line for that utterance from the original raw MLS transcript). The `DataEntry` class is a dataclass which contains 2 attributes: -1. `data` is an Optional dictionary containing items which represent the JSON manifest entry. `data` can also be `None`. If a `.process_dataset_entry(data_entry)` method returns a `DataEntry` class where `data == None`, then that utterance will be dropped from the output manifest. +1. `data` is an Optional dictionary containing items which represent the JSON manifest entry. `data` can also be `None`. If a `.process_dataset_entry(data_entry)` method returns a `DataEntry` class where `data is None`, then that utterance will be dropped from the output manifest. 2. `metrics`, which can be of any type, and are `None` by default. This variable is used by some variables to record summary statistics about the changes made to the dataset, these metrics are aggregated and can be displayed once every utterance has been processed by the processor. ### What happends in `BaseParallelProcessor.process()` From 0efbb2a1dee4df4386c2cea9a6830fb15b783345 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Tue, 8 Nov 2022 16:39:24 -0800 Subject: [PATCH 09/16] Write simple README for SDP and move complex explanations to docs Signed-off-by: Elena Rastorgueva --- .../source/tools/speech_dataset_processor.rst | 126 ++++++++++- tools/speech_dataset_processor/README.md | 199 +++++------------- 2 files changed, 176 insertions(+), 149 deletions(-) diff --git a/docs/source/tools/speech_dataset_processor.rst b/docs/source/tools/speech_dataset_processor.rst index 31683c31231c..f60fc32b431e 100644 --- a/docs/source/tools/speech_dataset_processor.rst +++ b/docs/source/tools/speech_dataset_processor.rst @@ -2,7 +2,127 @@ Speech Dataset Processor ======================== Speech Dataset Processor (SDP) is a toolkit to make it easy to: -1. write code to process a new dataset, minimizing the amount of boilerplate code required. -2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file. + 1. write code to process a new dataset, minimizing the amount of boilerplate code required. + 2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file. -More information about it can be found `here `_. +SDP's philosophy is to represent processing operations as 'processor' classes. Many common processing operations are provided, and it is easy to add your own. In some cases, all you will need to do to process a new dataset is simply to write a YAML file containing the parameters needed to process your dataset. + +SDP is specifically intended for the use case when you have an existing dataset with the audio & text pairs already specified in some form, and you wish to create a JSON manifest suitable for use with NeMo. SDP allows for intermediate cleaning and filtering steps which involve amending the 'ground truth' ``"text"`` or dropping utterances which are deemed to be too inaccurate for training on. + +Overview of how SDP processes a dataset +--------------------------------------- + + 1. You call the ``main.py`` script, passing in a YAML config file, possibly with some overrides. + 2. ``main.py`` script calls ``run_processors.py``, passing in your config. + 3. ``run_processors.py`` does the following: + + a. picks out the processors you wish to run (you can specify a subset of the processors in the config override, e.g. to avoid re-running time-consuming steps). + b. if some of the processors have not had "output_manifest_file" or "input_manfiest_file" entries specified, SDP will automatically create temporary files for those. + c. instantiate the processor classes using ``hydra.utils.instantiate`` + d. run the run-time processor tests by calling the ``processor.test()`` method. + e. run the processing method (``processor.process()``) of each processor in order. + + +Layout of config YAML files +--------------------------- + +The YAML config file for processing a dataset must contain a key ``processors``, the value of which is a list. Each item in that list is expected to be a dictionary specifying a processor class, i.e. it must have a key ``_target_``, the value of which is a path to a "processor" class, and the remaining keys must be the kwargs necessary to instantiate that class with ``hydra.utils.instantiate()`` (c.f. https://hydra.cc/docs/advanced/instantiate_objects/overview/). + +SDP will run the processors specified in the ``processors`` list in the config file. It will also check for a ``processors_to_run`` key in the config file, which can be either the string "all", or any Python "slice" object. + +.. note:: + SDP will run the processors in the order in which they are listed in the config YAML file. Make sure to list the processors in an order which makes sense, e.g. create an initial manifest first; make sure to run asr inference before doing any processing which looks at ``pred_text`` fields in the manifest. + +Processor classes +----------------- + + +``Base Processor`` +---------------- + +All processor classes inherit from the ``BaseProcessor`` class. This is a very simple abstract class which has 2 empty methods: ``process()`` and ``test()``. +These serve to remind us that SDP essentially just runs ``test()`` on all processors, and then ``process()`` on all processors. + +``ASRInference`` is a child class of ``BaseProcessor``. It has a simple ``process()`` method which runs transcription on every utterance in the input_manifest. + +``WriteManifest`` is also a child class of ``BaseProcessor``. It has a simple ``process()`` method which saves a copy of the input manifest containing only the fields specified in ``fields_to_save``. + +``BaseParallelProcessor`` +----------------------- +``BaseParallelProcessor`` inherits from the ``BaseProcessor`` class. Within its ``.process()`` method, it calls other methods and functions, which allow it to do more complex processing. +Most importantly, it calls its ``.process_dataset_entry(data_entry)`` method on every utterance in the manifest, and it does this in parallel, allowing for more efficient processing. + +What is a ``DataEntry``? +---------------------- +As mentioned above, ``BaseParallelProcessor.process_dataset_entry(data_entry)`` is called on a variable called ``data_entry`` which represents an utterance in our dataset. +Most often, ``data_entry`` will be a dictionary containing items which represent the JSON manifest entry. +Sometimes, such as in ``CreateInitialManifestMLS``, it will be a string containing a line for that utterance from the original raw MLS transcript. + +``BaseParallelProcessor.process_dataset_entry`` will process ``data_entry`` and output a ``DataEntry`` object. + +The ``DataEntry`` class is a dataclass which contains 2 attributes: +1. ``data`` is an Optional dictionary containing items which represent the JSON manifest entry. ``data`` can also be ``None``. If a ``.process_dataset_entry(data_entry)`` method returns a ``DataEntry`` class where ``data is None``, then that utterance will be dropped from the output manifest. +2. ``metrics``, which can be of any type, and are ``None`` by default. This variable is used by some variables to record summary statistics about the changes made to the dataset, these metrics are aggregated and can be displayed once every utterance has been processed by the processor. + +What happens in ``BaseParallelProcessor.process()`` +------------------------------------------------- + +We outline the ``.process()`` method of the ``BaseParallelProcessor`` class below: + +.. raw:: html + +
+ +
+ + +``ModifyManifestTextProcessor`` +----------------------------- + +``ModifyManifestTextProcessor`` inherits from the ``BaseProcessor`` class. It takes in an additional optional parameter ``test_cases`` and overwrites a few methods: +* ``.test()``: this method makes sure that the output from the processor matches the expected output specified in the ``test_cases`` parameter. +* ``.process_dataset_entry(data_entry)``: this method applies processing to a ``data_entry``. First, spaces are added to the start and end of the 'text' and 'pred_text' entries (if they exist), then the abstract method ``._process_dataset_entry(data_entry)`` is called. Then, any extra spaces (e.g. two spaces next to each other ' ') are removed from 'text' and 'pred_text' entries. +* ``._process_dataset_entry(data_entry)``: this is an abstract method which will be over-written by children of ``ModifyManfiestTextProcessor``. + + +How to make your own processor classes +-------------------------------------- + +We will describe how to make your own processor classes by referring to SDP's existing classes. + +Creating an initial manifest, e.g. as in ``CreateInitialManifestMLS``. +-------------------------------------------------------------------- +``CreateInitialManifestMLS`` is a child class of ``BaseParallelProcessor``. It downloads raw MLS data for a specified language, and creates an initial manifest (in the format expected by NeMo) which can be cleaned by subsequent processors. + +Its ``.prepare()`` method downloads and extracts the raw data. + +Its ``read_manifest()`` method reads the lines in the raw MLS transcript file. + +Its ``process_dataset_entry()`` method takes in the lines from the raw MLS transcript file, and outputs ``DataEntry`` objects containing entries that will be saved into the manifest (i.e. ``"audio_filepath"``, ``"duration"``, ``"text"``) for each utterance. + + +A ``ModifyManifestTextProcessor`` class that cleans ground truth text, e.g. as in ``SubSubstringToSpace``. +------------------------------------------------------------------------------------------------------ + +One of the classes provided in SDP is ``SubSubstringToSpace``. At initialization, it takes in ``substrings``, a list of strings which, if found in the "text", will be converted to spaces. This is helpful for e.g. removing punctuation. + +In its ``_process_dataset_entry(data_entry)`` method it does the string to space convertion upon the ``data_entry`` that is input. Its output is a ``data_entry`` with the changes applied to ``data``, and the the metrics of which substrings were spotted and converted to spaces recorded in ``metrics``. These metrics will be aggregated over all utterances by the ``BaseParallelProcessor`` class. ``SubSubstringToSpace`` also has a ``.finalize(metrics)`` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. + +A ``ModifyManifestTextProcessor`` class that drops incorrectly transcribed utterances, e.g. as in ``DropHighLowCharrate``. +---------------------------------------------------------------------------------------------------------------------- + +One of the classes provided in SDP is ``DropHighLowCharrate``. At initialization, it takes in ``high_charrate_threshold`` and ``low_charrate_threshold``, for which the utterance will be dropped if it is above or below each value respectively. This is helpful for automatically filtering out incorrectly transcribed utterances. + +In its ``_process_dataset_entry(data_entry)`` method it evaluates the character rate of the utterance. If the character rate is within bounds, it will return the same ``data_entry`` that was input. If the character rate is out of bounds, it will return a ``data_entry`` with ``data=None`` and ``metrics`` which reflect the applied changes. +Similar to the ``SubSubstringToSpace`` class, it has a ``.finalize(metrics)`` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. + +Class diagram +------------- +A diagram of the classes mentioned above is included here. Arrows represent inheritance. + + +.. raw:: html + +
+ +
diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index 377ff1a36148..777ae9ad5ead 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -8,158 +8,65 @@ SDP's philosophy is to represent processing operations as 'processor' classes. M SDP is specifically intended for the use case when you have an existing dataset with the audio & text pairs already specified in some form, and you wish to create a JSON manifest suitable for use with NeMo. SDP allows for intermediate cleaning and filtering steps which involve amending the 'ground truth' `"text"` or dropping utterances which are deemed to be too inaccurate for training on. -# Overview of how SDP processes a dataset -1. You call the main.py script, passing in a config.yaml file, possibly with some overrides. -2. main.py script calls run_processors.py, passing in your config. -3. run_processors.py does the following: +## Quick intro to Speech Dataset Processor - a. picks out the processors you wish to run (you can specify a subset of the processors in the config override, e.g. to avoid re-running time-consuming steps). +* The steps to process a dataset are specified by a YAML file +* YAML file contains a list of processor classes & the args to bass into the constructor +* Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file) & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio. +* To process a manifest, you need to list the chain of processors you wish to use. +* If a processor is not included, you can make your own -> see more documation about that [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html). - b. if some of the processors have not had "output_manifest_file" or "input_manfiest_file" entries specified, SDP will automatically create temporary files for those. +## Config file layout +A simplified version of an SDP file can be: - c. instantiate the processor classes using `hydra.utils.instantiate` - - d. run the run-time processor tests by calling the `processor.test()` method. +```yaml +processors: - e. run the processing method (`processor.process()`) of each processor in order. - - -# Layout of config YAML files - -The YAML config file for processing a dataset must contain a key `processors`, the value of which is a list. Each item in that list is expected to be a dictionary specifying a processor class, i.e. it must have a key `_target_`, the value of which is a path to a "processor" class, and the remaining keys must be the kwargs necessary to instantiate that class with `hydra.utils.instantiate()` (c.f. https://hydra.cc/docs/advanced/instantiate_objects/overview/). - -SDP will run the processors specified in the `processors` list in the config file. It will also check for a `processors_to_run` key in the config file, which can be either the string "all", or any Python "slice" object. - -> **Note**: SDP will run the processors in the order in which they are listed in the config YAML file. Make sure to list the processors in an order which makes sense, e.g. create an initial manifest first; make sure to run asr inference before doing any processing which looks at `pred_text` fields in the manifest. - -# Processor classes - -## `Base Processor` -All processor classes inherit from the `BaseProcessor` class. This is a very simple abstract class which has 2 empty methods: `process()` and `test()`. These serve to remind us that SDP essentially just runs `test()` on all processors, and then `process()` on all processors. - -`ASRInference` is a child class of `BaseProcessor`. It has a simple `process()` method which runs transcription on every utterance in the input_manifest. - -`WriteManifest` is also a child class of `BaseProcessor`. It has a simple `process()` method which saves a copy of the input manifest containing only the fields specified in `fields_to_save`. - -## `BaseParallelProcessor` -`BaseParallelProcessor` inherits from the `BaseProcessor` class. Within its `.process()` method, it calls other methods and functions, which allow it to do more complex processing. Most importantly, it calls its `.process_dataset_entry(data_entry)` method on every utterance in the manifest, and it does this in parallel, allowing for more efficient processing. - -### What is a `DataEntry`? -As mentioned above, `BaseParallelProcessor.process_dataset_entry(data_entry)` is called on a variable called `data_entry` which represents an utterance in our dataset. In most cases, `data_entry` is a `DataEntry` object, which represents a line in a dataset manifest. -> The only exception to the above is in processors which are run at the start of processing when we are creating a manifest for the first time (such as `CreateInitialManifestMLS`, in which the `data_entry` variable is a string containing a line for that utterance from the original raw MLS transcript). - -The `DataEntry` class is a dataclass which contains 2 attributes: -1. `data` is an Optional dictionary containing items which represent the JSON manifest entry. `data` can also be `None`. If a `.process_dataset_entry(data_entry)` method returns a `DataEntry` class where `data is None`, then that utterance will be dropped from the output manifest. -2. `metrics`, which can be of any type, and are `None` by default. This variable is used by some variables to record summary statistics about the changes made to the dataset, these metrics are aggregated and can be displayed once every utterance has been processed by the processor. - -### What happends in `BaseParallelProcessor.process()` - -We outline the `.process()` method of the `BaseParallelProcessor` class below: - -```mermaid -graph TD; - subgraph "Steps in BaseParallelProcessor.process() method" - A["self.prepare()
empty method by default
can be used to
e.g. download raw data automatically"] --> B - B["self.read_manifest()
reads input manifest
(ie from previous processing step)
if there is one"] --> C - C["self.process_dataset_entry(data_entry)
abstract method"] --> D - D["save output manifest
& aggregate metrics
(e.g. # of utts dropped)"] --> E - E["self.finalize_metrics()
e.g. display metrics from processing"] - - - end - -``` - -## `ModifyManifestTextProcessor` -`ModifyManifestTextProcessor` inherits from the `BaseProcessor` class. It takes in an additional optional parameter `test_cases` and overwrites a few methods: -* `.test()`: this method makes sure that the output from the processor matches the expected output specified in the `test_cases` parameter. -* `.process_dataset_entry(data_entry)`: this method applies processing to a `data_entry`. First, spaces are added to the start and end of the 'text' and 'pred_text' entries (if they exist), then the abstract method `._process_dataset_entry(data_entry)` is called. Then, any extra spaces (e.g. two spaces next to each other ' ') are removed from 'text' and 'pred_text' entries. -* `._process_dataset_entry(data_entry)`: this is an abstract method which will be over-written by children of `ModifyManfiestTextProcessor`. - - -## How to make your own processor classes. - -We will describe how to make your own processor classes by referring to SDP's existing classes. - -### Creating an initial manifest, e.g. as in `CreateInitialManifestMLS`. -`CreateInitialManifestMLS` is a child class of `BaseParallelProcessor`. It downloads raw MLS data for a specified language, and creates an initial manifest (in the format expected by NeMo) which can be cleaned by subsequent processors. - -Its `.prepare()` method downloads and extracts the raw data. - -Its `read_manifest()` method reads the lines in the raw MLS transcript file. - -Its `process_dataset_entry()` method takes in the lines from the raw MLS transcript file, and outputs `DataEntry` objects containing entries that will be saved into the manifest (i.e. `"audio_filepath"`, `"duration"`, `"text"`) for each utterance. - - -### A `ModifyManifestTextProcessor` class that cleans ground truth text, e.g. as in `SubSubstringToSpace`. - -One of the classes provided in SDP is `SubSubstringToSpace`. At initialization, it takes in `substrings`, a list of strings which, if found in the "text", will be converted to spaces. This is helpful for e.g. removing punctuation. - -In its `_process_dataset_entry(data_entry)` method it does the string to space convertion upon the `data_entry` that is input. Its output is a `data_entry` with the changes applied to `data`, and the the metrics of which substrings were spotted and converted to spaces recorded in `metrics`. These metrics will be aggregated over all utterances by the `BaseParallelProcessor` class. `SubSubstringToSpace` also has a `.finalize(metrics)` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. - -### A `ModifyManifestTextProcessor` class that drops incorrectly transcribed utterances, e.g. as in `DropHighLowCharrate`. - -One of the classes provided in SDP is `DropHighLowCharrate`. At initialization, it takes in `high_charrate_threshold` and `low_charrate_threshold`, for which the utterance will be dropped if it is above or below each value respectively. This is helpful for automatically filtering out incorrectly transcribed utterances. - -In its `_process_dataset_entry(data_entry)` method it evaluates the character rate of the utterance. If the character rate is within bounds, it will return the same `data_entry` that was input. If the character rate is out of bounds, it will return a `data_entry` with `data=None` and `metrics` which reflect the applied changes. -Similar to the `SubSubstringToSpace` class, it has a `.finalize(metrics)` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. - -## Class diagram -A diagram of the classes mentioned above is included here. Arrows represent inheritance. -```mermaid -classDiagram -BaseProcessor <|-- BaseParallelProcessor -BaseProcessor <|-- ASRInference -BaseProcessor <|-- WriteManifest + # use existing classes for popular datasets or make your own class + - _target_: sdp.processors.CreateInitialManifestMLS + output_manifest_file: ... + download_dir: ... + ... -BaseParallelProcessor <|-- CreateInitialManifestMLS -BaseParallelProcessor <|-- ModifyManifestTextProcessor + # use existing classes for common operations or write your own + - _target_: sdp.processors.SubSubstringToSubstring -ModifyManifestTextProcessor <|-- SubSubstringToSpace -ModifyManifestTextProcessor <|-- DropHighLowCharrate + substring_pairs: { + # specify the parameters needed for your usecase + " mr ": " mister ", + "," : " ", + "." : " ", + } -class BaseProcessor{ - output_manifest_file - input_manifest_file - process() - test() -} -class BaseParallelProcessor{ - process() - prepare() - read_manifest() - process_dataset_entry() - finalize() -} -class ASRInference{ - pretrained_model - batch_size - process() -} -class WriteManifest{ - fields_to_save - process() -} -class CreateInitialManifestMLS{ + - _target_: sdp.processors.DropNonAlphabet + alphabet: " abcdefghijklmnopqrstuvwxyz" + output_manifest_file: ... ... - ....() -} -class ModifyManifestTextProcessor{ - test_cases - test() - process_dataset_entry(data_entry) - _process_dataset_entry(data_entry) -} -class SubSubstringToSpace{ - substrings - _process_dataset_entry(data_entry) - finalize(metrics) -} -class DropHighLowCharrate{ - high_charrate_threshold - low_charrate_threshold - _process_dataet_entry(data_entry) - finalize(metrics) - -} ``` +## Existing processor classes +In addition to those mentioned in the example config file, many more classes are already included in Speech Dataset Processor, for example: +* `sdp.processors.ASRInference` will run inference on the manifest using a specified `pretrained_model`. +* `sdp.processors.DropHighWER` will compute WER between `text` and `pred_text` of each utterance and remove the utterance if WER is greater than the specified `wer_threshold`. +* `sdp.processors.DropHighLowCharrate` will compute the character rate in the utterance using `text` and `duration`, and drop the utterance if it is outside the bounds of the specified `high_charrate_threshold` and `low_charrate_threshold`. Carefully chosen thresholds will allow us to drop utterances with incorrect ground truth `text`. + +## Processor test cases +You can add test cases to verify you have specified your desired changes correctly and to help document why your are making these changes. + +For example: +```yaml +processors: + ... + - _target_: sdp.processors.DropIfRegexInAttribute + attribute_to_regex: + "text" : ["(\\D ){5,20}"] # looks for between 4 and 19 characters surrounded by spaces + + test_cases: + - {input: {text: "some s p a c e d out letters"}, output: null} + - {input: {text: "normal words only"}, output: {text: "normal words only"}} + - {input: {text: "three a b c spaced out letters"}, output: {text: "three a b c spaced out letters"}} + - {input: {text: "four a b c d spaced out letters"}, output: null} + ... +``` + +## Speech Dataset Processor documentation +More details about SDP can be found in the documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html). \ No newline at end of file From 0b645a13ac461af357ab0e79d6ce457c36ace48e Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Tue, 8 Nov 2022 16:43:54 -0800 Subject: [PATCH 10/16] Remove incorrect type hints Signed-off-by: Elena Rastorgueva --- .../sdp/processors/modify_manifest/modify_manifest.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py index 192c99a0191b..37ab5518a0b9 100644 --- a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py +++ b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py @@ -62,10 +62,10 @@ def test(self): ) @abstractmethod - def _process_dataset_entry(self, data_entry: Type[DataEntry]): + def _process_dataset_entry(self, data_entry): pass - def process_dataset_entry(self, data_entry: Type[DataEntry]): + def process_dataset_entry(self, data_entry): """Wrapper for 'process_dataset_entry' abstract method. Before 'process_dataset_entry' is called, the function From 57b862a1e9d30d9ce4797bffa349ad4f8a716460 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Tue, 8 Nov 2022 16:51:52 -0800 Subject: [PATCH 11/16] Make config example less confusing Signed-off-by: Elena Rastorgueva --- tools/speech_dataset_processor/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index 777ae9ad5ead..556ee43dc20a 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -10,8 +10,8 @@ SDP is specifically intended for the use case when you have an existing dataset ## Quick intro to Speech Dataset Processor -* The steps to process a dataset are specified by a YAML file -* YAML file contains a list of processor classes & the args to bass into the constructor +* The steps to process a dataset are specified by a YAML file. +* YAML file contains a list of processor classes & the args to bass into the constructor. * Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file) & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio. * To process a manifest, you need to list the chain of processors you wish to use. * If a processor is not included, you can make your own -> see more documation about that [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html). @@ -34,8 +34,8 @@ processors: substring_pairs: { # specify the parameters needed for your usecase " mr ": " mister ", - "," : " ", - "." : " ", + " misteak ": " mistake ", + ... } - _target_: sdp.processors.DropNonAlphabet From 27c74588a5f1836485dd988851971cdefec8aedb Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Tue, 8 Nov 2022 16:53:25 -0800 Subject: [PATCH 12/16] Fix typo Signed-off-by: Elena Rastorgueva --- tools/speech_dataset_processor/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index 556ee43dc20a..b0f1a582d435 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -11,7 +11,7 @@ SDP is specifically intended for the use case when you have an existing dataset ## Quick intro to Speech Dataset Processor * The steps to process a dataset are specified by a YAML file. -* YAML file contains a list of processor classes & the args to bass into the constructor. +* YAML file contains a list of processor classes & the args to pass into the constructor. * Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file) & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio. * To process a manifest, you need to list the chain of processors you wish to use. * If a processor is not included, you can make your own -> see more documation about that [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html). From f744e4c7491b32dd0b47acadab5b5e810b78cb3f Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Tue, 8 Nov 2022 16:56:22 -0800 Subject: [PATCH 13/16] Clarify that YAML file is config file in README Signed-off-by: Elena Rastorgueva --- tools/speech_dataset_processor/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index b0f1a582d435..3028ab405f3d 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -10,13 +10,13 @@ SDP is specifically intended for the use case when you have an existing dataset ## Quick intro to Speech Dataset Processor -* The steps to process a dataset are specified by a YAML file. -* YAML file contains a list of processor classes & the args to pass into the constructor. +* The steps to process a dataset are specified by a YAML config file. +* The YAML config file contains a list of processor classes & the args to pass into the constructor. * Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file) & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio. * To process a manifest, you need to list the chain of processors you wish to use. * If a processor is not included, you can make your own -> see more documation about that [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html). -## Config file layout +## YAML config file layout A simplified version of an SDP file can be: ```yaml From 9d9132fe537b0e8f311e171d2d3e64b83f07029c Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Wed, 9 Nov 2022 10:08:37 -0800 Subject: [PATCH 14/16] Remove unused imports Signed-off-by: Elena Rastorgueva --- .../sdp/processors/modify_manifest/modify_manifest.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py index 37ab5518a0b9..5c8ceefebe8e 100644 --- a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py +++ b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py @@ -14,9 +14,9 @@ from abc import abstractmethod -from typing import Dict, List, Optional, Type +from typing import Dict, List, Optional -from sdp.processors.base_processor import BaseParallelProcessor, DataEntry +from sdp.processors.base_processor import BaseParallelProcessor from sdp.utils.edit_spaces import add_start_end_spaces, remove_extra_spaces From 6f04e61c2b137f579ee996e255a48b883481e178 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Wed, 9 Nov 2022 10:22:58 -0800 Subject: [PATCH 15/16] Remove SDP docs for now Signed-off-by: Elena Rastorgueva --- docs/source/tools/intro.rst | 1 - .../source/tools/speech_dataset_processor.rst | 128 ------------------ 2 files changed, 129 deletions(-) delete mode 100644 docs/source/tools/speech_dataset_processor.rst diff --git a/docs/source/tools/intro.rst b/docs/source/tools/intro.rst index 18a26cf03652..2a0a062040ec 100644 --- a/docs/source/tools/intro.rst +++ b/docs/source/tools/intro.rst @@ -9,6 +9,5 @@ NeMo provides a set of tools useful for developing Automatic Speech Recognitions ctc_segmentation speech_data_explorer - speech_dataset_processor diff --git a/docs/source/tools/speech_dataset_processor.rst b/docs/source/tools/speech_dataset_processor.rst deleted file mode 100644 index f60fc32b431e..000000000000 --- a/docs/source/tools/speech_dataset_processor.rst +++ /dev/null @@ -1,128 +0,0 @@ -Speech Dataset Processor -======================== - -Speech Dataset Processor (SDP) is a toolkit to make it easy to: - 1. write code to process a new dataset, minimizing the amount of boilerplate code required. - 2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file. - -SDP's philosophy is to represent processing operations as 'processor' classes. Many common processing operations are provided, and it is easy to add your own. In some cases, all you will need to do to process a new dataset is simply to write a YAML file containing the parameters needed to process your dataset. - -SDP is specifically intended for the use case when you have an existing dataset with the audio & text pairs already specified in some form, and you wish to create a JSON manifest suitable for use with NeMo. SDP allows for intermediate cleaning and filtering steps which involve amending the 'ground truth' ``"text"`` or dropping utterances which are deemed to be too inaccurate for training on. - -Overview of how SDP processes a dataset ---------------------------------------- - - 1. You call the ``main.py`` script, passing in a YAML config file, possibly with some overrides. - 2. ``main.py`` script calls ``run_processors.py``, passing in your config. - 3. ``run_processors.py`` does the following: - - a. picks out the processors you wish to run (you can specify a subset of the processors in the config override, e.g. to avoid re-running time-consuming steps). - b. if some of the processors have not had "output_manifest_file" or "input_manfiest_file" entries specified, SDP will automatically create temporary files for those. - c. instantiate the processor classes using ``hydra.utils.instantiate`` - d. run the run-time processor tests by calling the ``processor.test()`` method. - e. run the processing method (``processor.process()``) of each processor in order. - - -Layout of config YAML files ---------------------------- - -The YAML config file for processing a dataset must contain a key ``processors``, the value of which is a list. Each item in that list is expected to be a dictionary specifying a processor class, i.e. it must have a key ``_target_``, the value of which is a path to a "processor" class, and the remaining keys must be the kwargs necessary to instantiate that class with ``hydra.utils.instantiate()`` (c.f. https://hydra.cc/docs/advanced/instantiate_objects/overview/). - -SDP will run the processors specified in the ``processors`` list in the config file. It will also check for a ``processors_to_run`` key in the config file, which can be either the string "all", or any Python "slice" object. - -.. note:: - SDP will run the processors in the order in which they are listed in the config YAML file. Make sure to list the processors in an order which makes sense, e.g. create an initial manifest first; make sure to run asr inference before doing any processing which looks at ``pred_text`` fields in the manifest. - -Processor classes ------------------ - - -``Base Processor`` ----------------- - -All processor classes inherit from the ``BaseProcessor`` class. This is a very simple abstract class which has 2 empty methods: ``process()`` and ``test()``. -These serve to remind us that SDP essentially just runs ``test()`` on all processors, and then ``process()`` on all processors. - -``ASRInference`` is a child class of ``BaseProcessor``. It has a simple ``process()`` method which runs transcription on every utterance in the input_manifest. - -``WriteManifest`` is also a child class of ``BaseProcessor``. It has a simple ``process()`` method which saves a copy of the input manifest containing only the fields specified in ``fields_to_save``. - -``BaseParallelProcessor`` ------------------------ -``BaseParallelProcessor`` inherits from the ``BaseProcessor`` class. Within its ``.process()`` method, it calls other methods and functions, which allow it to do more complex processing. -Most importantly, it calls its ``.process_dataset_entry(data_entry)`` method on every utterance in the manifest, and it does this in parallel, allowing for more efficient processing. - -What is a ``DataEntry``? ----------------------- -As mentioned above, ``BaseParallelProcessor.process_dataset_entry(data_entry)`` is called on a variable called ``data_entry`` which represents an utterance in our dataset. -Most often, ``data_entry`` will be a dictionary containing items which represent the JSON manifest entry. -Sometimes, such as in ``CreateInitialManifestMLS``, it will be a string containing a line for that utterance from the original raw MLS transcript. - -``BaseParallelProcessor.process_dataset_entry`` will process ``data_entry`` and output a ``DataEntry`` object. - -The ``DataEntry`` class is a dataclass which contains 2 attributes: -1. ``data`` is an Optional dictionary containing items which represent the JSON manifest entry. ``data`` can also be ``None``. If a ``.process_dataset_entry(data_entry)`` method returns a ``DataEntry`` class where ``data is None``, then that utterance will be dropped from the output manifest. -2. ``metrics``, which can be of any type, and are ``None`` by default. This variable is used by some variables to record summary statistics about the changes made to the dataset, these metrics are aggregated and can be displayed once every utterance has been processed by the processor. - -What happens in ``BaseParallelProcessor.process()`` -------------------------------------------------- - -We outline the ``.process()`` method of the ``BaseParallelProcessor`` class below: - -.. raw:: html - -
- -
- - -``ModifyManifestTextProcessor`` ------------------------------ - -``ModifyManifestTextProcessor`` inherits from the ``BaseProcessor`` class. It takes in an additional optional parameter ``test_cases`` and overwrites a few methods: -* ``.test()``: this method makes sure that the output from the processor matches the expected output specified in the ``test_cases`` parameter. -* ``.process_dataset_entry(data_entry)``: this method applies processing to a ``data_entry``. First, spaces are added to the start and end of the 'text' and 'pred_text' entries (if they exist), then the abstract method ``._process_dataset_entry(data_entry)`` is called. Then, any extra spaces (e.g. two spaces next to each other ' ') are removed from 'text' and 'pred_text' entries. -* ``._process_dataset_entry(data_entry)``: this is an abstract method which will be over-written by children of ``ModifyManfiestTextProcessor``. - - -How to make your own processor classes --------------------------------------- - -We will describe how to make your own processor classes by referring to SDP's existing classes. - -Creating an initial manifest, e.g. as in ``CreateInitialManifestMLS``. --------------------------------------------------------------------- -``CreateInitialManifestMLS`` is a child class of ``BaseParallelProcessor``. It downloads raw MLS data for a specified language, and creates an initial manifest (in the format expected by NeMo) which can be cleaned by subsequent processors. - -Its ``.prepare()`` method downloads and extracts the raw data. - -Its ``read_manifest()`` method reads the lines in the raw MLS transcript file. - -Its ``process_dataset_entry()`` method takes in the lines from the raw MLS transcript file, and outputs ``DataEntry`` objects containing entries that will be saved into the manifest (i.e. ``"audio_filepath"``, ``"duration"``, ``"text"``) for each utterance. - - -A ``ModifyManifestTextProcessor`` class that cleans ground truth text, e.g. as in ``SubSubstringToSpace``. ------------------------------------------------------------------------------------------------------- - -One of the classes provided in SDP is ``SubSubstringToSpace``. At initialization, it takes in ``substrings``, a list of strings which, if found in the "text", will be converted to spaces. This is helpful for e.g. removing punctuation. - -In its ``_process_dataset_entry(data_entry)`` method it does the string to space convertion upon the ``data_entry`` that is input. Its output is a ``data_entry`` with the changes applied to ``data``, and the the metrics of which substrings were spotted and converted to spaces recorded in ``metrics``. These metrics will be aggregated over all utterances by the ``BaseParallelProcessor`` class. ``SubSubstringToSpace`` also has a ``.finalize(metrics)`` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. - -A ``ModifyManifestTextProcessor`` class that drops incorrectly transcribed utterances, e.g. as in ``DropHighLowCharrate``. ----------------------------------------------------------------------------------------------------------------------- - -One of the classes provided in SDP is ``DropHighLowCharrate``. At initialization, it takes in ``high_charrate_threshold`` and ``low_charrate_threshold``, for which the utterance will be dropped if it is above or below each value respectively. This is helpful for automatically filtering out incorrectly transcribed utterances. - -In its ``_process_dataset_entry(data_entry)`` method it evaluates the character rate of the utterance. If the character rate is within bounds, it will return the same ``data_entry`` that was input. If the character rate is out of bounds, it will return a ``data_entry`` with ``data=None`` and ``metrics`` which reflect the applied changes. -Similar to the ``SubSubstringToSpace`` class, it has a ``.finalize(metrics)`` method which will log information about the aggregated metrics after all of the utterances in the manifest have been processed. - -Class diagram -------------- -A diagram of the classes mentioned above is included here. Arrows represent inheritance. - - -.. raw:: html - -
- -
From 575d51bdf8d4b341f8c7b70471c4caa3adeff7dd Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva Date: Wed, 9 Nov 2022 10:25:04 -0800 Subject: [PATCH 16/16] Remove links to docs in SDP README Signed-off-by: Elena Rastorgueva --- tools/speech_dataset_processor/README.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index 3028ab405f3d..31f22f5d81bf 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -14,7 +14,7 @@ SDP is specifically intended for the use case when you have an existing dataset * The YAML config file contains a list of processor classes & the args to pass into the constructor. * Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file) & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio. * To process a manifest, you need to list the chain of processors you wish to use. -* If a processor is not included, you can make your own -> see more documation about that [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html). +* If a processor is not included, you can make your own. ## YAML config file layout A simplified version of an SDP file can be: @@ -66,7 +66,4 @@ processors: - {input: {text: "three a b c spaced out letters"}, output: {text: "three a b c spaced out letters"}} - {input: {text: "four a b c d spaced out letters"}, output: null} ... -``` - -## Speech Dataset Processor documentation -More details about SDP can be found in the documentation [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html). \ No newline at end of file +``` \ No newline at end of file