diff --git a/docs/source/nlp/text_normalization/intro.rst b/docs/source/nlp/text_normalization/intro.rst
index e560372f8831..1b9365728fcc 100644
--- a/docs/source/nlp/text_normalization/intro.rst
+++ b/docs/source/nlp/text_normalization/intro.rst
@@ -1,6 +1,8 @@
(Inverse) Text Normalization
============================
+NeMo supports Text Normalization (TN) and Inverse Text Normalization (ITN) tasks via rule-based `nemo_text_processing` python package and Neural-based TN/ITN models.
+
Rule-based (WFST) TN/ITN:
.. toctree::
@@ -9,11 +11,10 @@ Rule-based (WFST) TN/ITN:
wfst/intro
-Neural TN/ITN:
+Neural-based TN/ITN:
.. toctree::
:maxdepth: 1
- nn_text_normalization
-
+ neural_models
diff --git a/docs/source/nlp/text_normalization/neural_models.rst b/docs/source/nlp/text_normalization/neural_models.rst
new file mode 100644
index 000000000000..10206da067a3
--- /dev/null
+++ b/docs/source/nlp/text_normalization/neural_models.rst
@@ -0,0 +1,23 @@
+.. _neural_models:
+
+Neural Models for (Inverse) Text Normalization
+==============================================
+
+NeMo provides two types of neural models:
+
+
+Duplex T5-based TN/ITN:
+
+.. toctree::
+ :maxdepth: 1
+
+ nn_text_normalization
+
+
+Single-pass Tagger-based ITN:
+
+.. toctree::
+ :maxdepth: 1
+
+ text_normalization_as_tagging
+
diff --git a/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst b/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst
new file mode 100644
index 000000000000..25926bd45c69
--- /dev/null
+++ b/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst
@@ -0,0 +1,165 @@
+.. _text_normalization_as_tagging:
+
+Thutmose Tagger: Single-pass Tagger-based ITN Model
+===================================================
+Inverse text normalization(ITN) converts text from spoken domain (e.g., an ASR output) into its written form:
+
+Input: ``on may third we paid one hundred and twenty three dollars``
+Output: ``on may 3 we paid $123``
+
+`ThutmoseTaggerModel `__ is a single-pass tagger-based model mapping spoken-domain words to written-domain fragments.
+Additionally this model predicts "semiotic" classes of the spoken words (e.g., words belonging to the spans that are about times, dates, or monetary amounts)
+
+The typical workflow is to first prepare the dataset, which requires to find granular alignments between spoken-domain words and written-domain fragments.
+An example bash-script for data preparation pipeline is provided: `prepare_dataset_en.sh `__.
+After getting the dataset you can train the model. An example training script is provided: `normalization_as_tagging_train.py `__.
+The script for inference from a raw text file is provided here: `normalization_as_tagging_infer.py `__.
+An example bash-script that runs inference and evaluation is provided here: `run_infer.sh `__.
+
+
+Quick Start Guide
+-----------------
+
+To run the pretrained models see :ref:`inference_text_normalization`.
+
+Available models
+^^^^^^^^^^^^^^^^
+
+.. list-table:: *Pretrained Models*
+ :widths: 5 10
+ :header-rows: 1
+
+ * - Model
+ - Pretrained Checkpoint
+ * - itn_en_thutmose_bert
+ - https://ngc.nvidia.com/catalog/models/nvidia:nemo:itn_en_thutmose_bert
+
+
+Initial Data
+------------
+The initial data from which the dataset is prepared is `Google text normalization dataset `__.
+It is stored in TAB separated files (``.tsv``) with three columns.
+The first column is the "semiotic class" (e.g., numbers, times, dates) , the second is the token
+in written form, and the third is the spoken form. An example sentence in the dataset is shown below.
+In the example, ```` denotes that the spoken form is the same as the written form.
+
+.. code::
+
+ PLAIN The
+ PLAIN company
+ PLAIN revenues
+ PLAIN grew
+ PLAIN four
+ PLAIN fold
+ PLAIN between
+ DATE 2005 two thousand five
+ PLAIN and
+ DATE 2008 two thousand eight
+ PUNCT .
+
+
+
+More information about the Google Text Normalization Dataset can be found in the paper `RNN Approaches to Text Normalization: A Challenge `__ :cite:`nlp-textnorm-sproat2016rnn`.
+
+
+Data preprocessing
+------------------
+
+Our preprocessing is rather complicated, because we need to find granular alignments for semiotic spans that are aligned at phrase-level in Google Text Normalization Dataset.
+Right now we only provide data preparation scripts for English and Russian languages, see `prepare_dataset_en.sh `__ and `prepare_dataset_ru.sh `__.
+Data preparation includes running the GIZA++ automatic alignment tool, see `install_requirements.sh `__ for installation details.
+The purpose of the preprocessing scripts is to build the training dataset for the tagging model.
+The final dataset has a simple 3-column tsv format: 1) input sentence, 2) tags for input words, 3) coordinates of "semiotic" spans if any
+
+.. code::
+
+ this plan was first enacted in nineteen eighty four and continued to be followed for nineteen years _19 8 4_ _19_ DATE 6 9;CARDINAL 15 16
+
+
+Model Training
+--------------
+
+An example training script is provided: `normalization_as_tagging_train.py `__.
+The config file used by default is `thutmose_tagger_itn_config.yaml `__.
+You can change any of the parameters directly from the config file or update them with the command-line arguments.
+
+Most arguments in the example config file are quite self-explanatory (e.g., *model.optim.lr* refers to the learning rate for training the decoder). We have set most of the hyper-parameters to
+be the values that we found to be effective (for the English and the Russian subsets of the Google TN dataset).
+Some arguments that you may want to modify are:
+
+- *lang*: The language of the dataset.
+
+- *data.train_ds.data_path*: The path to the training file.
+
+- *data.validation_ds.data_path*: The path to the validation file.
+
+- *model.language_model.pretrained_model_name*: The huggingface transformer model used to initialize the model weights
+
+- *model.label_map*: The path/.../label_map.txt. This is the dictionary of possible output tags that model may produce.
+
+- *model.semiotic_classes*: The path/to/.../semiotic_classes.txt. This is the list of possible semiotic classes.
+
+
+Example of a training command:
+
+.. code::
+
+ python examples/nlp/text_normalization_as_tagging/normalization_as_tagging_train.py \
+ lang=en \
+ data.validation_ds.data_path=/valid.tsv \
+ data.train_ds.data_path=/train.tsv \
+ model.language_model.pretrained_model_name=bert-base-uncased \
+ model.label_map=/label_map.txt \
+ model.semiotic_classes=/semiotic_classes.txt \
+ trainer.max_epochs=5
+
+
+
+.. _inference_text_normalization:
+
+Model Inference
+---------------
+
+Run the inference:
+
+.. code::
+
+ python examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \
+ pretrained_model=itn_en_thutmose_bert \
+ inference.from_file=./test_sent.txt \
+ inference.out_file=./output.tsv
+
+The output tsv file consists of 5 columns:
+
+ * Final output text - it is generated from predicted tags after some simple post-processing.
+ * Input text.
+ * Sequence of predicted tags - one tag for each input word.
+ * Sequence of tags after post-processing (some swaps may be applied).
+ * Sequence of predicted semiotic classes - one class for each input word.
+
+
+Model Architecture
+------------------
+
+The model first uses a Transformer encoder (e.g., bert-base-uncased) to build a
+contextualized representation for each input token. It then uses a classification head
+to predict the tag for each token. Another classification head is used to predict a "semiotic" class label for each token.
+
+Overall, our design is partly inspired by the LaserTagger approach proposed in the paper
+`Encode, tag, realize: High-precision text editing `__ :cite:`nlp-textnorm-malmi2019encode`.
+
+The LaserTagger method is not directly applicable to ITN because it can only regard the whole non-common fragment as a single
+replacement tag, whereas spoken-to-written conversion, e.g. a date, needs to be aligned on a more granular level. Otherwise,
+the tag vocabulary should include all possible numbers, dates etc. which is impossible. For example, given an example pair "over
+four hundred thousand fish" - "over 400,000 fish", LaserTagger will need a single replacement "400,000" in the tag vocabulary.
+To overcome this problem, we use another method of collecting the vocabulary of replacement tags, based on automatic alignment of spoken-domain words to small fragments of
+written-domain text along with and tags.
+
+
+References
+----------
+
+.. bibliography:: tn_itn_all.bib
+ :style: plain
+ :labelprefix: NLP-TEXTNORM
+ :keyprefix: nlp-textnorm-
diff --git a/docs/source/nlp/text_normalization/tn_itn_all.bib b/docs/source/nlp/text_normalization/tn_itn_all.bib
index 42f9a090021f..6fc843110e16 100644
--- a/docs/source/nlp/text_normalization/tn_itn_all.bib
+++ b/docs/source/nlp/text_normalization/tn_itn_all.bib
@@ -87,4 +87,11 @@ @inproceedings{koehn-etal-2007-moses
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P07-2045",
pages = "177--180",
-}
\ No newline at end of file
+}
+
+@article{malmi2019encode,
+ title={Encode, tag, realize: High-precision text editing},
+ author={Malmi, Eric and Krause, Sebastian and Rothe, Sascha and Mirylenka, Daniil and Severyn, Aliaksei},
+ journal={arXiv preprint arXiv:1909.01187},
+ year={2019}
+}
diff --git a/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb b/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb
index 3ee9d319515f..d05cabd36a4e 100644
--- a/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb
+++ b/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb
@@ -113,6 +113,7 @@
},
"outputs": [],
"source": [
+ "!rm -r en_data_small"
"!wget \"https://multilangaudiosamples.s3.us-east-2.amazonaws.com/en_data_small.zip\"\n",
"!unzip en_data_small"
]