diff --git a/docs/source/nlp/text_normalization/intro.rst b/docs/source/nlp/text_normalization/intro.rst index e560372f8831..1b9365728fcc 100644 --- a/docs/source/nlp/text_normalization/intro.rst +++ b/docs/source/nlp/text_normalization/intro.rst @@ -1,6 +1,8 @@ (Inverse) Text Normalization ============================ +NeMo supports Text Normalization (TN) and Inverse Text Normalization (ITN) tasks via rule-based `nemo_text_processing` python package and Neural-based TN/ITN models. + Rule-based (WFST) TN/ITN: .. toctree:: @@ -9,11 +11,10 @@ Rule-based (WFST) TN/ITN: wfst/intro -Neural TN/ITN: +Neural-based TN/ITN: .. toctree:: :maxdepth: 1 - nn_text_normalization - + neural_models diff --git a/docs/source/nlp/text_normalization/neural_models.rst b/docs/source/nlp/text_normalization/neural_models.rst new file mode 100644 index 000000000000..10206da067a3 --- /dev/null +++ b/docs/source/nlp/text_normalization/neural_models.rst @@ -0,0 +1,23 @@ +.. _neural_models: + +Neural Models for (Inverse) Text Normalization +============================================== + +NeMo provides two types of neural models: + + +Duplex T5-based TN/ITN: + +.. toctree:: + :maxdepth: 1 + + nn_text_normalization + + +Single-pass Tagger-based ITN: + +.. toctree:: + :maxdepth: 1 + + text_normalization_as_tagging + diff --git a/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst b/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst new file mode 100644 index 000000000000..25926bd45c69 --- /dev/null +++ b/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst @@ -0,0 +1,165 @@ +.. _text_normalization_as_tagging: + +Thutmose Tagger: Single-pass Tagger-based ITN Model +=================================================== +Inverse text normalization(ITN) converts text from spoken domain (e.g., an ASR output) into its written form: + +Input: ``on may third we paid one hundred and twenty three dollars`` +Output: ``on may 3 we paid $123`` + +`ThutmoseTaggerModel `__ is a single-pass tagger-based model mapping spoken-domain words to written-domain fragments. +Additionally this model predicts "semiotic" classes of the spoken words (e.g., words belonging to the spans that are about times, dates, or monetary amounts) + +The typical workflow is to first prepare the dataset, which requires to find granular alignments between spoken-domain words and written-domain fragments. +An example bash-script for data preparation pipeline is provided: `prepare_dataset_en.sh `__. +After getting the dataset you can train the model. An example training script is provided: `normalization_as_tagging_train.py `__. +The script for inference from a raw text file is provided here: `normalization_as_tagging_infer.py `__. +An example bash-script that runs inference and evaluation is provided here: `run_infer.sh `__. + + +Quick Start Guide +----------------- + +To run the pretrained models see :ref:`inference_text_normalization`. + +Available models +^^^^^^^^^^^^^^^^ + +.. list-table:: *Pretrained Models* + :widths: 5 10 + :header-rows: 1 + + * - Model + - Pretrained Checkpoint + * - itn_en_thutmose_bert + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:itn_en_thutmose_bert + + +Initial Data +------------ +The initial data from which the dataset is prepared is `Google text normalization dataset `__. +It is stored in TAB separated files (``.tsv``) with three columns. +The first column is the "semiotic class" (e.g., numbers, times, dates) , the second is the token +in written form, and the third is the spoken form. An example sentence in the dataset is shown below. +In the example, ```` denotes that the spoken form is the same as the written form. + +.. code:: + + PLAIN The + PLAIN company + PLAIN revenues + PLAIN grew + PLAIN four + PLAIN fold + PLAIN between + DATE 2005 two thousand five + PLAIN and + DATE 2008 two thousand eight + PUNCT . + + + +More information about the Google Text Normalization Dataset can be found in the paper `RNN Approaches to Text Normalization: A Challenge `__ :cite:`nlp-textnorm-sproat2016rnn`. + + +Data preprocessing +------------------ + +Our preprocessing is rather complicated, because we need to find granular alignments for semiotic spans that are aligned at phrase-level in Google Text Normalization Dataset. +Right now we only provide data preparation scripts for English and Russian languages, see `prepare_dataset_en.sh `__ and `prepare_dataset_ru.sh `__. +Data preparation includes running the GIZA++ automatic alignment tool, see `install_requirements.sh `__ for installation details. +The purpose of the preprocessing scripts is to build the training dataset for the tagging model. +The final dataset has a simple 3-column tsv format: 1) input sentence, 2) tags for input words, 3) coordinates of "semiotic" spans if any + +.. code:: + + this plan was first enacted in nineteen eighty four and continued to be followed for nineteen years _19 8 4_ _19_ DATE 6 9;CARDINAL 15 16 + + +Model Training +-------------- + +An example training script is provided: `normalization_as_tagging_train.py `__. +The config file used by default is `thutmose_tagger_itn_config.yaml `__. +You can change any of the parameters directly from the config file or update them with the command-line arguments. + +Most arguments in the example config file are quite self-explanatory (e.g., *model.optim.lr* refers to the learning rate for training the decoder). We have set most of the hyper-parameters to +be the values that we found to be effective (for the English and the Russian subsets of the Google TN dataset). +Some arguments that you may want to modify are: + +- *lang*: The language of the dataset. + +- *data.train_ds.data_path*: The path to the training file. + +- *data.validation_ds.data_path*: The path to the validation file. + +- *model.language_model.pretrained_model_name*: The huggingface transformer model used to initialize the model weights + +- *model.label_map*: The path/.../label_map.txt. This is the dictionary of possible output tags that model may produce. + +- *model.semiotic_classes*: The path/to/.../semiotic_classes.txt. This is the list of possible semiotic classes. + + +Example of a training command: + +.. code:: + + python examples/nlp/text_normalization_as_tagging/normalization_as_tagging_train.py \ + lang=en \ + data.validation_ds.data_path=/valid.tsv \ + data.train_ds.data_path=/train.tsv \ + model.language_model.pretrained_model_name=bert-base-uncased \ + model.label_map=/label_map.txt \ + model.semiotic_classes=/semiotic_classes.txt \ + trainer.max_epochs=5 + + + +.. _inference_text_normalization: + +Model Inference +--------------- + +Run the inference: + +.. code:: + + python examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \ + pretrained_model=itn_en_thutmose_bert \ + inference.from_file=./test_sent.txt \ + inference.out_file=./output.tsv + +The output tsv file consists of 5 columns: + + * Final output text - it is generated from predicted tags after some simple post-processing. + * Input text. + * Sequence of predicted tags - one tag for each input word. + * Sequence of tags after post-processing (some swaps may be applied). + * Sequence of predicted semiotic classes - one class for each input word. + + +Model Architecture +------------------ + +The model first uses a Transformer encoder (e.g., bert-base-uncased) to build a +contextualized representation for each input token. It then uses a classification head +to predict the tag for each token. Another classification head is used to predict a "semiotic" class label for each token. + +Overall, our design is partly inspired by the LaserTagger approach proposed in the paper +`Encode, tag, realize: High-precision text editing `__ :cite:`nlp-textnorm-malmi2019encode`. + +The LaserTagger method is not directly applicable to ITN because it can only regard the whole non-common fragment as a single +replacement tag, whereas spoken-to-written conversion, e.g. a date, needs to be aligned on a more granular level. Otherwise, +the tag vocabulary should include all possible numbers, dates etc. which is impossible. For example, given an example pair "over +four hundred thousand fish" - "over 400,000 fish", LaserTagger will need a single replacement "400,000" in the tag vocabulary. +To overcome this problem, we use another method of collecting the vocabulary of replacement tags, based on automatic alignment of spoken-domain words to small fragments of +written-domain text along with and tags. + + +References +---------- + +.. bibliography:: tn_itn_all.bib + :style: plain + :labelprefix: NLP-TEXTNORM + :keyprefix: nlp-textnorm- diff --git a/docs/source/nlp/text_normalization/tn_itn_all.bib b/docs/source/nlp/text_normalization/tn_itn_all.bib index 42f9a090021f..6fc843110e16 100644 --- a/docs/source/nlp/text_normalization/tn_itn_all.bib +++ b/docs/source/nlp/text_normalization/tn_itn_all.bib @@ -87,4 +87,11 @@ @inproceedings{koehn-etal-2007-moses publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P07-2045", pages = "177--180", -} \ No newline at end of file +} + +@article{malmi2019encode, + title={Encode, tag, realize: High-precision text editing}, + author={Malmi, Eric and Krause, Sebastian and Rothe, Sascha and Mirylenka, Daniil and Severyn, Aliaksei}, + journal={arXiv preprint arXiv:1909.01187}, + year={2019} +} diff --git a/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb b/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb index 3ee9d319515f..d05cabd36a4e 100644 --- a/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb +++ b/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb @@ -113,6 +113,7 @@ }, "outputs": [], "source": [ + "!rm -r en_data_small" "!wget \"https://multilangaudiosamples.s3.us-east-2.amazonaws.com/en_data_small.zip\"\n", "!unzip en_data_small" ]