diff --git a/tutorials/llm/llama-3/README.rst b/tutorials/llm/llama-3/README.rst
index 663c0c99abfc..be232edd1df0 100755
--- a/tutorials/llm/llama-3/README.rst
+++ b/tutorials/llm/llama-3/README.rst
@@ -1,174 +1,19 @@
-Llama 3 LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM
-==========================================================================
-`Llama 3 `_ is an open-source large language model by Meta that delivers state-of-the-art performance on popular industry benchmarks. It has been pretrained on over 15 trillion tokens, and supports an 8K token context length. It is available in two sizes, 8B and 70B, and each size has two variants—base pretrained and instruction tuned.
+Getting Started with Llama 3 and Llama 3.1
+==========================================
-`Low-Rank Adaptation (LoRA) `__ has emerged as a popular Parameter-Efficient Fine-Tuning (PEFT) technique that tunes a very small number of additional parameters as compared to full fine-tuning, thereby reducing the compute required.
+This repository contains jupyter notebook tutorials using NeMo Framework for Llama-3 and Llama-3.1 models by Meta.
-`NVIDIA NeMo
-Framework `__ provides tools to perform LoRA on Llama 3 to fit your use case, which can then be deployed using `NVIDIA NIM `__ for optimized inference on NVIDIA GPUs.
+.. list-table::
+ :widths: 100 25 100
+ :header-rows: 1
-.. figure:: ./img/e2e-lora-train-and-deploy.png
- :width: 1000
- :alt: Diagram showing the steps for LoRA customization using the NVIDIA NeMo Framework and deployment with NVIDIA NIM. The steps include converting the base model to .nemo format, creating LoRA adapters with NeMo, and then depoying the LoRA adapter with NIM for inference.
- :align: center
-
- Figure 1: Steps for LoRA customization using the NVIDIA NeMo Framework and deployment with NVIDIA NIM
-
-
-| NIM enables seamless deployment of multiple LoRA adapters (referred to as “multi-LoRA”) on the same base model. It dynamically loads the adapter weights based on incoming requests at runtime. This flexibility allows handling inputs from various tasks or use cases without deploying a unique model for each individual scenario. For further details, consult the `NIM documentation for LLMs `__.
-
-Requirements
--------------
-
-* System Configuration
- * Access to at least 1 NVIDIA GPU with a cumulative memory of at least 80GB, for example: 1 x H100-80GB or 1 x A100-80GB.
- * A Docker-enabled environment, with `NVIDIA Container Runtime `_ installed, which will make the container GPU-aware.
- * `Additional NIM requirements `_.
-
-* `Authenticate with NVIDIA NGC `_, and download `NGC CLI Tool `_. You will use this tool to download the model and customize it with NeMo Framework.
-
-
-`Create a LoRA Adapter with NeMo Framework <./llama3-lora-nemofw.ipynb>`__
---------------------------------------------------------------------------
-
-This notebook shows how to perform LoRA PEFT on **Llama 3 8B Instruct** using `PubMedQA `__ with NeMo Framework. PubMedQA is a Question-Answering dataset for biomedical texts. You will use the NeMo Framework which is available as a `docker container `__.
-
-1. Download the `Llama 3 8B Instruct .nemo `__ from NVIDIA NGC using the NGC CLI. The following command saves the ``.nemo`` format model in a folder named ``llama-3-8b-instruct-nemo_v1.0`` in the current directory. You can specify another path using the ``-d`` option in the CLI tool.
-
-.. code:: bash
-
- ngc registry model download-version "nvidia/nemo/llama-3-8b-instruct-nemo:1.0"
-
-
-Alternatively, you can download the model from `Hugging Face `__ and convert it to the ``.nemo`` format using the Hugging Face to NeMo `Llama checkpoint conversion script `__. If you'd like to skip this extra step, the ``.nemo`` model is available on NGC as mentioned above.
-
-2. Run the container using the following command. It is assumed that you have the notebook(s) and llama-3-8b-instruct model available in the current directory. If not, mount the appropriate folder to ``/workspace``.
-
-.. code:: bash
-
- export FW_VERSION=24.05 # Make sure to choose the latest available tag
-
-
-.. code:: bash
-
- docker run \
- --gpus all \
- --shm-size=2g \
- --net=host \
- --ulimit memlock=-1 \
- --rm -it \
- -v ${PWD}:/workspace \
- -w /workspace \
- -v ${PWD}/results:/results \
- nvcr.io/nvidia/nemo:$FW_VERSION bash
-
-3. From within the container, start the Jupyter lab:
-
-.. code:: bash
-
- jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
-
-4. Then, navigate to `this notebook <./llama3-lora-nemofw.ipynb>`__.
-
-
-`Deploy Multiple LoRA Inference Adapters with NVIDIA NIM <./llama3-lora-deploy-nim.ipynb>`__
---------------------------------------------------------------------------------------------
-
-This procedure demonstrates how to deploy multiple LoRA adapters with NVIDIA NIM. NIM supports LoRA adapters in ``.nemo`` (from NeMo Framework), and Hugging Face model formats. You will deploy the PubMedQA LoRA adapter from the first notebook, alongside two previously trained LoRA adapters (`GSM8K `__, `SQuAD `__) that are available on NVIDIA NGC as examples.
-
-``NOTE``: Although it’s not mandatory to finish the LoRA training and secure the adapter from the preceding notebook (“Creating a LoRA adapter with NeMo Framework”) to proceed with this one, it is advisable. Regardless, you can continue to learn about LoRA deployment with NIM using other adapters that you’ve downloaded from NVIDIA NGC.
-
-
-1. Download the example LoRA adapters.
-
-The following steps assume that you have authenticated with NGC and downloaded the CLI tool, as listed in the Requirements section.
-
-.. code:: bash
-
- # Set path to your LoRA model store
- export LOCAL_PEFT_DIRECTORY="$(pwd)/loras"
-
-
-.. code:: bash
-
- mkdir -p $LOCAL_PEFT_DIRECTORY
- pushd $LOCAL_PEFT_DIRECTORY
-
- # downloading NeMo-format loras
- ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
- ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"
-
- popd
- chmod -R 777 $LOCAL_PEFT_DIRECTORY
-
-2. Prepare the LoRA model store.
-
-After training is complete, that LoRA model checkpoint will be created at ``./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo``, assuming default paths in the first notebook weren’t modified.
-
-To ensure the model store is organized as expected, create a folder named ``llama3-8b-pubmed-qa``, and move your ``.nemo`` checkpoint there.
-
-.. code:: bash
-
- mkdir -p $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa
-
- # Ensure the source path is correct
- cp ./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa
-
-
-
-Ensure that the LoRA model store directory follows this structure: the model name(s) should be sub-folder(s) containing the ``.nemo`` file(s).
-
-::
-
- <$LOCAL_PEFT_DIRECTORY>
- ├── llama3-8b-instruct-lora_vnemo-math-v1
- │ └── llama3_8b_math.nemo
- ├── llama3-8b-instruct-lora_vnemo-squad-v1
- │ └── llama3_8b_squad.nemo
- └── llama3-8b-pubmed-qa
- └── megatron_gpt_peft_lora_tuning.nemo
-
-The last one was just trained on the PubmedQA dataset in the previous notebook.
-
-
-3. Set-up NIM.
-
-From your host OS environment, start the NIM docker container while mounting the LoRA model store, as follows:
-
-.. code:: bash
-
- # Set these configurations
- export NGC_API_KEY=
- export NIM_PEFT_REFRESH_INTERVAL=3600 # (in seconds) will check NIM_PEFT_SOURCE for newly added models in this interval
- export NIM_CACHE_PATH= # Model artifacts (in container) are cached in this directory
-
-
-.. code:: bash
-
- mkdir -p $NIM_CACHE_PATH
- chmod -R 777 $NIM_CACHE_PATH
-
- export NIM_PEFT_SOURCE=/home/nvs/loras # Path to LoRA models internal to the container
- export CONTAINER_NAME=meta-llama3-8b-instruct
-
- docker run -it --rm --name=$CONTAINER_NAME \
- --runtime=nvidia \
- --gpus all \
- --shm-size=16GB \
- -e NGC_API_KEY \
- -e NIM_PEFT_SOURCE \
- -e NIM_PEFT_REFRESH_INTERVAL \
- -v $NIM_CACHE_PATH:/opt/nim/.cache \
- -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
- -p 8000:8000 \
- nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
-
-The first time you run the command, it will download the model and cache it in ``$NIM_CACHE_PATH`` so subsequent deployments are even faster. There are several options to configure NIM other than the ones listed above. You can find a full list in the `NIM configuration `__ documentation.
-
-
-4. Start the notebook.
-
-From another terminal, follow the same instructions as the previous notebook to launch Jupyter Lab, and then navigate to `this notebook <./llama3-lora-deploy-nim.ipynb>`__.
-
-You can use the same NeMo Framework docker container which has Jupyter Lab already installed.
\ No newline at end of file
+ * - Tutorial
+ - Dataset
+ - Description
+ * - `Llama 3 LoRA Fine-Tuning and Multi-LoRA Deployment with NeMo Framework and NVIDIA NIM <./biomedical-qa>`_
+ - `PubMedQA `_
+ - Perform LoRA PEFT on Llama 3 8B Instruct using a dataset for bio-medical domain question answering. Deploy multiple LoRA adapters with NVIDIA NIM.
+ * - `Llama 3.1 Law-Domain LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM <./sdg-law-title-generation>`_
+ - `Law StackExchange `_
+ - Perform LoRA PEFT on Llama 3.1 8B Instruct using a synthetically augmented version of Law StackExchange with NeMo Framework, followed by deployment with NVIDIA NIM. As a pre-requisite, follow the tutorial for `data curation using NeMo Curator `__.
diff --git a/tutorials/llm/llama-3/biomedical-qa/README.rst b/tutorials/llm/llama-3/biomedical-qa/README.rst
new file mode 100755
index 000000000000..663c0c99abfc
--- /dev/null
+++ b/tutorials/llm/llama-3/biomedical-qa/README.rst
@@ -0,0 +1,174 @@
+Llama 3 LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM
+==========================================================================
+
+`Llama 3 `_ is an open-source large language model by Meta that delivers state-of-the-art performance on popular industry benchmarks. It has been pretrained on over 15 trillion tokens, and supports an 8K token context length. It is available in two sizes, 8B and 70B, and each size has two variants—base pretrained and instruction tuned.
+
+`Low-Rank Adaptation (LoRA) `__ has emerged as a popular Parameter-Efficient Fine-Tuning (PEFT) technique that tunes a very small number of additional parameters as compared to full fine-tuning, thereby reducing the compute required.
+
+`NVIDIA NeMo
+Framework `__ provides tools to perform LoRA on Llama 3 to fit your use case, which can then be deployed using `NVIDIA NIM `__ for optimized inference on NVIDIA GPUs.
+
+.. figure:: ./img/e2e-lora-train-and-deploy.png
+ :width: 1000
+ :alt: Diagram showing the steps for LoRA customization using the NVIDIA NeMo Framework and deployment with NVIDIA NIM. The steps include converting the base model to .nemo format, creating LoRA adapters with NeMo, and then depoying the LoRA adapter with NIM for inference.
+ :align: center
+
+ Figure 1: Steps for LoRA customization using the NVIDIA NeMo Framework and deployment with NVIDIA NIM
+
+
+| NIM enables seamless deployment of multiple LoRA adapters (referred to as “multi-LoRA”) on the same base model. It dynamically loads the adapter weights based on incoming requests at runtime. This flexibility allows handling inputs from various tasks or use cases without deploying a unique model for each individual scenario. For further details, consult the `NIM documentation for LLMs `__.
+
+Requirements
+-------------
+
+* System Configuration
+ * Access to at least 1 NVIDIA GPU with a cumulative memory of at least 80GB, for example: 1 x H100-80GB or 1 x A100-80GB.
+ * A Docker-enabled environment, with `NVIDIA Container Runtime `_ installed, which will make the container GPU-aware.
+ * `Additional NIM requirements `_.
+
+* `Authenticate with NVIDIA NGC `_, and download `NGC CLI Tool `_. You will use this tool to download the model and customize it with NeMo Framework.
+
+
+`Create a LoRA Adapter with NeMo Framework <./llama3-lora-nemofw.ipynb>`__
+--------------------------------------------------------------------------
+
+This notebook shows how to perform LoRA PEFT on **Llama 3 8B Instruct** using `PubMedQA `__ with NeMo Framework. PubMedQA is a Question-Answering dataset for biomedical texts. You will use the NeMo Framework which is available as a `docker container `__.
+
+1. Download the `Llama 3 8B Instruct .nemo `__ from NVIDIA NGC using the NGC CLI. The following command saves the ``.nemo`` format model in a folder named ``llama-3-8b-instruct-nemo_v1.0`` in the current directory. You can specify another path using the ``-d`` option in the CLI tool.
+
+.. code:: bash
+
+ ngc registry model download-version "nvidia/nemo/llama-3-8b-instruct-nemo:1.0"
+
+
+Alternatively, you can download the model from `Hugging Face `__ and convert it to the ``.nemo`` format using the Hugging Face to NeMo `Llama checkpoint conversion script `__. If you'd like to skip this extra step, the ``.nemo`` model is available on NGC as mentioned above.
+
+2. Run the container using the following command. It is assumed that you have the notebook(s) and llama-3-8b-instruct model available in the current directory. If not, mount the appropriate folder to ``/workspace``.
+
+.. code:: bash
+
+ export FW_VERSION=24.05 # Make sure to choose the latest available tag
+
+
+.. code:: bash
+
+ docker run \
+ --gpus all \
+ --shm-size=2g \
+ --net=host \
+ --ulimit memlock=-1 \
+ --rm -it \
+ -v ${PWD}:/workspace \
+ -w /workspace \
+ -v ${PWD}/results:/results \
+ nvcr.io/nvidia/nemo:$FW_VERSION bash
+
+3. From within the container, start the Jupyter lab:
+
+.. code:: bash
+
+ jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
+
+4. Then, navigate to `this notebook <./llama3-lora-nemofw.ipynb>`__.
+
+
+`Deploy Multiple LoRA Inference Adapters with NVIDIA NIM <./llama3-lora-deploy-nim.ipynb>`__
+--------------------------------------------------------------------------------------------
+
+This procedure demonstrates how to deploy multiple LoRA adapters with NVIDIA NIM. NIM supports LoRA adapters in ``.nemo`` (from NeMo Framework), and Hugging Face model formats. You will deploy the PubMedQA LoRA adapter from the first notebook, alongside two previously trained LoRA adapters (`GSM8K `__, `SQuAD `__) that are available on NVIDIA NGC as examples.
+
+``NOTE``: Although it’s not mandatory to finish the LoRA training and secure the adapter from the preceding notebook (“Creating a LoRA adapter with NeMo Framework”) to proceed with this one, it is advisable. Regardless, you can continue to learn about LoRA deployment with NIM using other adapters that you’ve downloaded from NVIDIA NGC.
+
+
+1. Download the example LoRA adapters.
+
+The following steps assume that you have authenticated with NGC and downloaded the CLI tool, as listed in the Requirements section.
+
+.. code:: bash
+
+ # Set path to your LoRA model store
+ export LOCAL_PEFT_DIRECTORY="$(pwd)/loras"
+
+
+.. code:: bash
+
+ mkdir -p $LOCAL_PEFT_DIRECTORY
+ pushd $LOCAL_PEFT_DIRECTORY
+
+ # downloading NeMo-format loras
+ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
+ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"
+
+ popd
+ chmod -R 777 $LOCAL_PEFT_DIRECTORY
+
+2. Prepare the LoRA model store.
+
+After training is complete, that LoRA model checkpoint will be created at ``./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo``, assuming default paths in the first notebook weren’t modified.
+
+To ensure the model store is organized as expected, create a folder named ``llama3-8b-pubmed-qa``, and move your ``.nemo`` checkpoint there.
+
+.. code:: bash
+
+ mkdir -p $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa
+
+ # Ensure the source path is correct
+ cp ./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa
+
+
+
+Ensure that the LoRA model store directory follows this structure: the model name(s) should be sub-folder(s) containing the ``.nemo`` file(s).
+
+::
+
+ <$LOCAL_PEFT_DIRECTORY>
+ ├── llama3-8b-instruct-lora_vnemo-math-v1
+ │ └── llama3_8b_math.nemo
+ ├── llama3-8b-instruct-lora_vnemo-squad-v1
+ │ └── llama3_8b_squad.nemo
+ └── llama3-8b-pubmed-qa
+ └── megatron_gpt_peft_lora_tuning.nemo
+
+The last one was just trained on the PubmedQA dataset in the previous notebook.
+
+
+3. Set-up NIM.
+
+From your host OS environment, start the NIM docker container while mounting the LoRA model store, as follows:
+
+.. code:: bash
+
+ # Set these configurations
+ export NGC_API_KEY=
+ export NIM_PEFT_REFRESH_INTERVAL=3600 # (in seconds) will check NIM_PEFT_SOURCE for newly added models in this interval
+ export NIM_CACHE_PATH= # Model artifacts (in container) are cached in this directory
+
+
+.. code:: bash
+
+ mkdir -p $NIM_CACHE_PATH
+ chmod -R 777 $NIM_CACHE_PATH
+
+ export NIM_PEFT_SOURCE=/home/nvs/loras # Path to LoRA models internal to the container
+ export CONTAINER_NAME=meta-llama3-8b-instruct
+
+ docker run -it --rm --name=$CONTAINER_NAME \
+ --runtime=nvidia \
+ --gpus all \
+ --shm-size=16GB \
+ -e NGC_API_KEY \
+ -e NIM_PEFT_SOURCE \
+ -e NIM_PEFT_REFRESH_INTERVAL \
+ -v $NIM_CACHE_PATH:/opt/nim/.cache \
+ -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
+ -p 8000:8000 \
+ nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
+
+The first time you run the command, it will download the model and cache it in ``$NIM_CACHE_PATH`` so subsequent deployments are even faster. There are several options to configure NIM other than the ones listed above. You can find a full list in the `NIM configuration `__ documentation.
+
+
+4. Start the notebook.
+
+From another terminal, follow the same instructions as the previous notebook to launch Jupyter Lab, and then navigate to `this notebook <./llama3-lora-deploy-nim.ipynb>`__.
+
+You can use the same NeMo Framework docker container which has Jupyter Lab already installed.
\ No newline at end of file
diff --git a/tutorials/llm/llama-3/img/e2e-lora-train-and-deploy.png b/tutorials/llm/llama-3/biomedical-qa/img/e2e-lora-train-and-deploy.png
similarity index 100%
rename from tutorials/llm/llama-3/img/e2e-lora-train-and-deploy.png
rename to tutorials/llm/llama-3/biomedical-qa/img/e2e-lora-train-and-deploy.png
diff --git a/tutorials/llm/llama-3/llama3-lora-deploy-nim.ipynb b/tutorials/llm/llama-3/biomedical-qa/llama3-lora-deploy-nim.ipynb
similarity index 100%
rename from tutorials/llm/llama-3/llama3-lora-deploy-nim.ipynb
rename to tutorials/llm/llama-3/biomedical-qa/llama3-lora-deploy-nim.ipynb
diff --git a/tutorials/llm/llama-3/llama3-lora-nemofw.ipynb b/tutorials/llm/llama-3/biomedical-qa/llama3-lora-nemofw.ipynb
similarity index 100%
rename from tutorials/llm/llama-3/llama3-lora-nemofw.ipynb
rename to tutorials/llm/llama-3/biomedical-qa/llama3-lora-nemofw.ipynb
diff --git a/tutorials/llm/llama-3/sdg-law-title-generation/README.rst b/tutorials/llm/llama-3/sdg-law-title-generation/README.rst
new file mode 100755
index 000000000000..58fc4a86eaa7
--- /dev/null
+++ b/tutorials/llm/llama-3/sdg-law-title-generation/README.rst
@@ -0,0 +1,170 @@
+Llama 3.1 Law-Domain LoRA Fine-Tuning and Deployment with NeMo Framework and NVIDIA NIM
+=======================================================================================
+
+`Llama 3.1 `_ are open-source large language models by Meta that deliver state-of-the-art performance on popular industry benchmarks. They have been pretrained on over 15 trillion tokens, and support a 128K token context length. They are available in three sizes, 8B, 70B, and 405B, and each size has two variants—base pretrained and instruction tuned.
+
+`Low-Rank Adaptation (LoRA) `__ has emerged as a popular Parameter-Efficient Fine-Tuning (PEFT) technique that tunes a very small number of additional parameters as compared to full fine-tuning, thereby reducing the compute required.
+
+`NVIDIA NeMo
+Framework `__ provides tools to perform LoRA on Llama 3.1 to fit your use case, which can then be deployed using `NVIDIA NIM `__ for optimized inference on NVIDIA GPUs.
+
+.. figure:: ./img/e2e-lora-train-and-deploy.png
+ :width: 1000
+ :alt: Diagram showing the steps for LoRA customization using the NVIDIA NeMo Framework and deployment with NVIDIA NIM. The steps include converting the base model to .nemo format, creating LoRA adapters with NeMo, and then depoying the LoRA adapter with NIM for inference.
+ :align: center
+
+ Figure 1: Steps for LoRA customization using the NVIDIA NeMo Framework and deployment with NVIDIA NIM
+
+
+| NIM also enables seamless deployment of multiple LoRA adapters (referred to as “multi-LoRA”) on the same base model. It dynamically loads the adapter weights based on incoming requests at runtime. This flexibility allows handling inputs from various tasks or use cases without deploying a unique model for each individual scenario. For further details, consult the `NIM documentation for LLMs `__.
+
+Objectives
+----------
+
+This tutorial shows how to perform LoRA PEFT on **Llama 3.1 8B Instruct** using a synthetically augmented version of `Law StackExchange `__ with NeMo Framework. Law StackExchange is a dataset of legal questions, question titles, and answers. For this demonstration, we will tune the model on the task of title/subject generation, that is, given a Law StackExchange forum question, auto-generate an appropriate title for it. We will then deploy the LoRA tuned model with NVIDIA NIM for inference.
+
+Requirements
+-------------
+
+* **Obtain the dataset:** This tutorial is a continuation of the Data Curation tutorial - `Curating Datasets for Parameter Efficient Fine-tuning with Synthetic Data Generation `__. It demonstrates various filtering and processing operations on the records to improve data quality, as well as (optional) synthetic data generation (SDG) to augment the dataset. Please follow this tutorial to obtain the resulting dataset needed.
+
+
+* System Configuration
+ * Access to at least 1 NVIDIA GPU with a cumulative memory of at least 80GB, for example: 1 x H100-80GB or 1 x A100-80GB.
+ * A Docker-enabled environment, with `NVIDIA Container Runtime `_ installed, which will make the container GPU-aware.
+ * `Additional NIM requirements `_.
+
+* `Authenticate with NVIDIA NGC `_, and download `NGC CLI Tool `_. You will use this tool to download the model and customize it with NeMo Framework.
+
+* Get your Hugging Face `access token `_, which will be used to obtain the tokenizer required during training.
+
+
+`Process the Dataset with NeMo Curator `__
+-------------------------------------------------------------------------------------------------------
+
+1. Save the dataset in the current directory. You will have obtained `law-qa-{train/val/test}.jsonl` splits resulting from following the abovementioned `data curation tutorial `__.
+
+.. code:: bash
+
+ mkdir -p curated-data
+
+ # Make sure to update the path below as appropriate
+ cp .jsonl curated-data/.
+
+
+`Create a LoRA Adapter with NeMo Framework <./llama3-sdg-lora-nemofw.ipynb>`__
+------------------------------------------------------------------------------
+
+For LoRA-tuning the model, you will use the NeMo Framework which is available as a `docker container `__.
+
+
+1. Download the `Llama 3.1 8B Instruct .nemo `__ from NVIDIA NGC using the NGC CLI. The following command saves the ``.nemo`` format model in a folder named ``llama-3_1-8b-instruct-nemo_v1.0`` in the current directory. You can specify another path using the ``-d`` option in the CLI tool.
+
+.. code:: bash
+
+ ngc registry model download-version "nvidia/nemo/llama-3_1-8b-instruct-nemo:1.0"
+
+
+
+2. Run the container using the following command. It is assumed that you have the dataset, notebook(s), and the `llama-3.1-8b-instruct` model available in the current directory. If not, mount the appropriate folder to ``/workspace``.
+
+.. code:: bash
+
+ export FW_VERSION=24.05.llama3.1
+
+
+.. code:: bash
+
+ docker run \
+ --gpus all \
+ --shm-size=2g \
+ --net=host \
+ --ulimit memlock=-1 \
+ --rm -it \
+ -v ${PWD}:/workspace \
+ -w /workspace \
+ -v ${PWD}/results:/results \
+ nvcr.io/nvidia/nemo:$FW_VERSION bash
+
+3. From within the container, start the Jupyter lab:
+
+.. code:: bash
+
+ jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
+
+4. Then, navigate to `this notebook <./llama3-sdg-lora-nemofw.ipynb>`__.
+
+
+`Deploy the LoRA Inference Adapter with NVIDIA NIM <./llama3-sdg-lora-deploy-nim.ipynb>`__
+--------------------------------------------------------------------------------------
+
+This procedure demonstrates how to deploy the trained LoRA adapter with NVIDIA NIM. NIM supports LoRA adapters in ``.nemo`` (from NeMo Framework), and Hugging Face model formats. You will deploy the Law StackExchange title-generation LoRA adapter from the first notebook.
+
+1. Prepare the LoRA model store.
+
+After training is complete, that LoRA model checkpoint will be created at ``./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo``, assuming default paths in the first notebook weren’t modified.
+
+To ensure the model store is organized as expected, create a folder named ``llama3.1-8b-law-titlegen`` under a model store directory, and move your ``.nemo`` checkpoint there.
+
+.. code:: bash
+
+ # Set path to your LoRA model store
+ export LOCAL_PEFT_DIRECTORY="$(pwd)/loras"
+
+ mkdir -p $LOCAL_PEFT_DIRECTORY/llama3.1-8b-law-titlegen
+
+ # Ensure the source path is correct
+ cp ./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo $LOCAL_PEFT_DIRECTORY/llama3.1-8b-law-titlegen
+
+
+Ensure that the LoRA model store directory follows this structure: the model name would be name of the sub-folder containing the ``.nemo`` file.
+
+::
+
+ <$LOCAL_PEFT_DIRECTORY>
+ └── llama3.1-8b-law-titlegen
+ └── megatron_gpt_peft_lora_tuning.nemo
+
+
+Note that NIM supports deployment of multiple LoRA adapters over the same base model. As such, if you have any other adapters for other tasks trained or available, you can place them in separate sub-folders under `$LOCAL_PEFT_DIRECTORY`.
+
+2. Set-up NIM.
+
+From your host OS environment, start the NIM docker container while mounting the LoRA model store, as follows:
+
+.. code:: bash
+
+ # Set these configurations
+ export NGC_API_KEY=
+ export NIM_PEFT_REFRESH_INTERVAL=3600 # (in seconds) will check NIM_PEFT_SOURCE for newly added models in this interval
+ export NIM_CACHE_PATH= # Model artifacts (in container) are cached in this directory
+
+
+.. code:: bash
+
+ mkdir -p $NIM_CACHE_PATH
+ chmod -R 777 $NIM_CACHE_PATH
+
+ export NIM_PEFT_SOURCE=/home/nvs/loras # Path to LoRA models internal to the container
+ export CONTAINER_NAME=meta-llama3.1-8b-instruct
+
+ docker run -it --rm --name=$CONTAINER_NAME \
+ --gpus all \
+ --network=host \
+ --shm-size=16GB \
+ -e NGC_API_KEY \
+ -e NIM_PEFT_SOURCE \
+ -v $NIM_CACHE_PATH:/opt/nim/.cache \
+ -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
+ nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.0
+
+The first time you run the command, it will download the model and cache it in ``$NIM_CACHE_PATH`` so subsequent deployments are even faster. There are several options to configure NIM other than the ones listed above. You can find a full list in the `NIM configuration `__ documentation.
+
+
+3. Start the notebook.
+
+From another terminal, follow the same instructions as the previous notebook to launch Jupyter Lab, and then navigate to `this notebook <./llama3-sdg-lora-deploy-nim.ipynb>`__.
+
+You can use the same NeMo Framework docker container which has Jupyter Lab already installed.
+
+
diff --git a/tutorials/llm/llama-3/sdg-law-title-generation/img/e2e-lora-train-and-deploy.png b/tutorials/llm/llama-3/sdg-law-title-generation/img/e2e-lora-train-and-deploy.png
new file mode 100755
index 000000000000..16bb47eed431
Binary files /dev/null and b/tutorials/llm/llama-3/sdg-law-title-generation/img/e2e-lora-train-and-deploy.png differ
diff --git a/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-deploy-nim.ipynb b/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-deploy-nim.ipynb
new file mode 100755
index 000000000000..783c5d951944
--- /dev/null
+++ b/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-deploy-nim.ipynb
@@ -0,0 +1,170 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "884f3125",
+ "metadata": {},
+ "source": [
+ "# LoRA inference with NVIDIA NIM\n",
+ "\n",
+ "This is a demonstration of running inference against a LoRA adapter deployed with NVIDIA NIM. NIM supports LoRA adapters in .nemo (from NeMo Framework), and Hugging Face model formats. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b161f16b",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "This notebook includes instructions to send an inference call to NVIDIA NIM using the Python `requests` library."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "54759732",
+ "metadata": {},
+ "source": [
+ "## Before you begin\n",
+ "Ensure that you satisfy the pre-requisites, and have completed the setup instructions provided in the README associated with this tutorial to deploy the NIM container with LoRA."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1e7917da",
+ "metadata": {},
+ "source": [
+ "---"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fa2477e9",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "39d9918a",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Check available LoRA models\n",
+ "\n",
+ "Once the NIM server is up and running, check the available models as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2d71965",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "url = 'http://0.0.0.0:8000/v1/models'\n",
+ "\n",
+ "response = requests.get(url)\n",
+ "data = response.json()\n",
+ "\n",
+ "print(json.dumps(data, indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5d5a93ac",
+ "metadata": {},
+ "source": [
+ "This will return all the models available for inference by NIM. In this case, it will return the base model, as well as the LoRA adapters that were provided during NIM deployment - `llama3.1-8b-law-titlegen`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e7c19acd",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## LoRA inference\n",
+ "\n",
+ "Inference can be performed by sending POST requests to the `/completions` endpoint.\n",
+ "\n",
+ "A few things to note:\n",
+ "* The `model` parameter in the payload specifies the model that the request will be directed to. This can be the base model `meta/llama3.1-8b-instruct`, or any of the LoRA models, such as `llama3.1-8b-law-titlegen`.\n",
+ "* `max_tokens` parameter specifies the maximum number of tokens to generate. At any point, the cumulative number of input prompt tokens and specified number of output tokens to generate should not exceed the model's maximum context limit. For llama3-8b-instruct, the context length supported is 8192 tokens.\n",
+ "\n",
+ "Following code snippets show how it's possible to send requests belonging to different LoRAs (or tasks). NIM dynamically loads the LoRA adapters and serves the requests. It also internally handles the batching of requests belonging to different LoRAs to allow better performance and more efficient of compute."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3edd2a9e",
+ "metadata": {},
+ "source": [
+ "### Title Generation\n",
+ "\n",
+ "Try sending an example from the test set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cf6ea42a",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "url = 'http://0.0.0.0:8000/v1/completions'\n",
+ "\n",
+ "headers = {\n",
+ " 'accept': 'application/json',\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ "\n",
+ "# Example from the test set, following the template we trained the lora with\n",
+ "prompt=\"Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \\nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \\nTITLE: \"\n",
+ "data = {\n",
+ " \"model\": \"llama3.1-8b-law-titlegen\",\n",
+ " \"prompt\": prompt,\n",
+ " \"max_tokens\": 25,\n",
+ " \"temperature\":0\n",
+ "}\n",
+ "\n",
+ "response = requests.post(url, headers=headers, json=data)\n",
+ "response_data = response.json()\n",
+ "\n",
+ "print(json.dumps(response_data, indent=4))"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb b/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb
new file mode 100755
index 000000000000..d597a60d6c0b
--- /dev/null
+++ b/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb
@@ -0,0 +1,562 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "a952a54c",
+ "metadata": {},
+ "source": [
+ "# Creating a Llama 3.1 LoRA adapter with NeMo Framework using a Synthetic Dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "63b3bd1c",
+ "metadata": {},
+ "source": [
+ "This notebook showcases performing LoRA finetuning on **Llama 3.1-8B-Instruct** with a synthetically augmented version of [Law StackExchange](https://huggingface.co/datasets/ymoslem/Law-StackExchange) dataset using NeMo Framework. Law StackExchange is a dataset of legal question/answers. Each record consists of a question, its title, as well as human-provided answers.\n",
+ "\n",
+ "For this demonstration, we will tune the model on the task of title/subject generation, that is, given a Law StackExchange forum question, auto-generate an appropriate title for it.\n",
+ "\n",
+ "> `NOTE:` Ensure that you run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies. **Instructions are available in the associated tutorial README to download the model and the container.**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "92ba5569",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!pip install ipywidgets"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1cf3dc30",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import json\n",
+ "import numpy as np\n",
+ "from rouge_score import rouge_scorer, scoring"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0b129833",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "---\n",
+ "## Before you begin\n",
+ "Ensure you have the following -\n",
+ "1. **Generate the synthetic dataset**: Follow the [PEFT Synthetic Data Generation (SDG)](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg) tutorial to obtain the synthetic dataset. Once obtained, you must follow the instructions in the associated README to mount it in the NeMo FW container."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8492edc1",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "DATA_DIR = os.path.join(\"/workspace/curated-data\")\n",
+ "\n",
+ "!ls {DATA_DIR}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "97990ad8",
+ "metadata": {},
+ "source": [
+ "You should see the `law-qa-{train/val/test}.jsonl` splits resulting from following the abovementioned SDG tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "63b061f4",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "TRAIN_DS = os.path.join(DATA_DIR, \"law-qa-train.jsonl\")\n",
+ "VAL_DS = os.path.join(DATA_DIR, \"law-qa-val.jsonl\")\n",
+ "TEST_DS = os.path.join(DATA_DIR, \"law-qa-test.jsonl\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1c0f9b8",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "2. **Get the model**: Download the `Meta Llama 3.1 8B Instruct .nemo` model and mount the corresponding folder to the container."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3728f222",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "!ls /workspace/llama-3_1-8b-instruct-nemo_v1.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "56b7a698",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "3. **Set the Hugging Face Access Token**: You can obtain this from your [Hugging Face account](https://huggingface.co/docs/hub/en/security-tokens). "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b546cb59",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from huggingface_hub import login\n",
+ "\n",
+ "login(token=\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "570025c5",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "---\n",
+ "## Step-by-step instructions\n",
+ "\n",
+ "This notebook is structured into four steps:\n",
+ "1. Prepare the dataset\n",
+ "2. Run the PEFT finetuning script\n",
+ "3. Inference with NeMo Framework\n",
+ "4. Check the model accuracy"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3894607a",
+ "metadata": {},
+ "source": [
+ "### Step 1: Prepare the dataset\n",
+ "\n",
+ "This dataset has already undergone several filtering and processing operations, and it can be used to train the model for various different tasks - question title generation (summarization), law domain question answering, and question tag generation (multi-label classification).\n",
+ "\n",
+ "Take a look at a single row in the dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c6b47e31",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# TRAIN, VAL and TEST splits all follow the same structure\n",
+ "!head -n1 {TRAIN_DS}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9bed493d",
+ "metadata": {},
+ "source": [
+ "You will see several fields in the `.jsonl`, including `title`, `question`, `answer`, and other associated metadata.\n",
+ "\n",
+ "For this tutorial, our input will be the `answer` field, and output will be it's `title`. \n",
+ "\n",
+ "The following cell does two things -\n",
+ "* Adds a template - a prompt instruction (which is optional), and format `{PROMPT} \\nQUESTION: {data[\"question\"]} \\nTITLE: `.\n",
+ "* Saves the data splits into the same location, also appending a `_preprocessed` marker to them."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "188b93b7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Add a prompt instruction.\n",
+ "PROMPT='''Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more.'''\n",
+ "\n",
+ "# Creates a preprocessed version of the data files\n",
+ "for input_file in [TRAIN_DS, VAL_DS, TEST_DS]:\n",
+ " output_file = input_file.rsplit('.', 1)[0] + '_preprocessed.jsonl'\n",
+ " with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:\n",
+ " for line in infile:\n",
+ " # Parse each line as JSON\n",
+ " data = json.loads(line)\n",
+ "\n",
+ " # Create a new dictionary with only the desired fields, renamed and formatted\n",
+ " new_data = {\n",
+ " \"input\": f'''{PROMPT} \\nQUESTION: {data[\"question\"]} \\nTITLE: ''',\n",
+ " \"output\": data['title']\n",
+ " }\n",
+ "\n",
+ " # Write the new data as a JSON line to the output file\n",
+ " json.dump(new_data, outfile)\n",
+ " outfile.write('\\n') # Add a newline after each JSON object\n",
+ "\n",
+ " print(f\"Processed {input_file} and created {output_file}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "39388cc3",
+ "metadata": {},
+ "source": [
+ "After running the above scripts, you will see `law-qa-{train/test/val}_preprocessed.jsonl` files appear in the data directory.\n",
+ "\n",
+ "This is what an example will be formatted like -\n",
+ "\n",
+ "```json\n",
+ "{\"input\": \"Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \\nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \\nTITLE: \", \n",
+ " \"output\": \"What constitutes \\\"doing business in a jurisdiction?\\\"\"}\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f53038ad",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# clear up any cached mem-map file\n",
+ "!rm curated-data/*idx*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ebd28f0d",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Step 2: Run PEFT finetuning script for LoRA\n",
+ "\n",
+ "NeMo framework includes a high level python script for fine-tuning [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!\n",
+ "\n",
+ "For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.\n",
+ "\n",
+ "> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files as well as path to the .nemo model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "15228de7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "%%bash\n",
+ "\n",
+ "# Set paths to the model, train, validation and test sets.\n",
+ "MODEL=\"/workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo\"\n",
+ "\n",
+ "TRAIN_DS=\"[./curated-data/law-qa-train_preprocessed.jsonl]\"\n",
+ "VALID_DS=\"[./curated-data/law-qa-val_preprocessed.jsonl]\"\n",
+ "TEST_DS=\"[./curated-data/law-qa-test_preprocessed.jsonl]\"\n",
+ "TEST_NAMES=\"[law]\"\n",
+ "\n",
+ "SCHEME=\"lora\"\n",
+ "TP_SIZE=1\n",
+ "PP_SIZE=1\n",
+ "\n",
+ "OUTPUT_DIR=\"./results/Meta-llama3.1-8B-Instruct-titlegen\"\n",
+ "rm -r $OUTPUT_DIR\n",
+ "\n",
+ "torchrun --nproc_per_node=1 \\\n",
+ "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \\\n",
+ " exp_manager.exp_dir=${OUTPUT_DIR} \\\n",
+ " exp_manager.explicit_log_dir=${OUTPUT_DIR} \\\n",
+ " trainer.devices=1 \\\n",
+ " trainer.num_nodes=1 \\\n",
+ " trainer.precision=bf16-mixed \\\n",
+ " trainer.val_check_interval=0.2 \\\n",
+ " trainer.max_steps=1000 \\\n",
+ " model.megatron_amp_O2=True \\\n",
+ " ++model.mcore_gpt=True \\\n",
+ " model.tensor_model_parallel_size=${TP_SIZE} \\\n",
+ " model.pipeline_model_parallel_size=${PP_SIZE} \\\n",
+ " model.micro_batch_size=1 \\\n",
+ " model.global_batch_size=32 \\\n",
+ " model.restore_from_path=${MODEL} \\\n",
+ " model.data.train_ds.file_names=${TRAIN_DS} \\\n",
+ " model.data.train_ds.concat_sampling_probabilities=[1.0] \\\n",
+ " model.data.validation_ds.file_names=${VALID_DS} \\\n",
+ " model.peft.peft_scheme=${SCHEME}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "268e4618",
+ "metadata": {},
+ "source": [
+ "This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/`. We'll use this later.\n",
+ "\n",
+ "To further configure the run above -\n",
+ "\n",
+ "* **A different PEFT technique**: The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [PEFT support matrix](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html). For example, for P-tuning, simply set \n",
+ "\n",
+ "```bash\n",
+ "model.peft.peft_scheme=\"ptuning\" # instead of \"lora\"\n",
+ "```\n",
+ "You can override many such configurations (such as `learning rate`, `adapter dim`, and more) while running the script. A full set of possible configurations is available in [NeMo Framework Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8fed5465",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Step 3: Inference with NeMo Framework\n",
+ "\n",
+ "Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged deployment solution like NVIDIA NIM."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a5adfc",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Check that the LORA model file exists\n",
+ "!ls -l ./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50b1dc9b",
+ "metadata": {},
+ "source": [
+ "In the code snippet below, the following configurations are worth noting - \n",
+ "\n",
+ "1. `model.restore_from_path` to the path for the Meta-Llama-3.1-8B-Instruct.nemo file.\n",
+ "2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.\n",
+ "3. `model.test_ds.file_names` to the path of the preprocessed test file."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0bd9d602",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Create a smaller test subset for a quick eval demonstration.\n",
+ "\n",
+ "!head -n 128 ./curated-data/law-qa-test_preprocessed.jsonl > ./curated-data/law-qa-test_preprocessed-n128.jsonl"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d029b0b5",
+ "metadata": {},
+ "source": [
+ "If you have made any changes in model or experiment paths, please ensure they are configured correctly below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "630c0305",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "%%bash\n",
+ "MODEL=\"/workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo\"\n",
+ "\n",
+ "TEST_DS=\"[./curated-data/law-qa-test_preprocessed-n128.jsonl]\" # Smaller test split\n",
+ "# TEST_DS=\"[./curated-data/law-qa-test_preprocessed.jsonl]\" # Full test set\n",
+ "TEST_NAMES=\"[law]\"\n",
+ "\n",
+ "TP_SIZE=1\n",
+ "PP_SIZE=1\n",
+ "\n",
+ "# This is where your LoRA checkpoint was saved\n",
+ "PATH_TO_TRAINED_MODEL=\"/workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo\"\n",
+ "\n",
+ "# The generation run will save the generated outputs over the test dataset in a file prefixed like so\n",
+ "OUTPUT_PREFIX=\"law_titlegen_lora\"\n",
+ "\n",
+ "python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \\\n",
+ " model.restore_from_path=${MODEL} \\\n",
+ " model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \\\n",
+ " trainer.devices=1 \\\n",
+ " trainer.num_nodes=1 \\\n",
+ " model.data.test_ds.file_names=${TEST_DS} \\\n",
+ " model.data.test_ds.names=${TEST_NAMES} \\\n",
+ " model.data.test_ds.global_batch_size=32 \\\n",
+ " model.data.test_ds.micro_batch_size=1 \\\n",
+ " model.data.test_ds.tokens_to_generate=25 \\\n",
+ " model.tensor_model_parallel_size=${TP_SIZE} \\\n",
+ " model.pipeline_model_parallel_size=${PP_SIZE} \\\n",
+ " inference.greedy=True \\\n",
+ " model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \\\n",
+ " model.data.test_ds.write_predictions_to_file=True \\\n",
+ " model.data.test_ds.truncation_field=\"null\" \\\n",
+ " model.data.test_ds.add_bos=False \\\n",
+ " model.data.test_ds.add_eos=True \\\n",
+ " model.data.test_ds.add_sep=False \\\n",
+ " model.data.test_ds.label_key=\"output\" \\\n",
+ " model.data.test_ds.prompt_template=\"\\{input\\}\\ \\{output\\}\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "513cd732",
+ "metadata": {},
+ "source": [
+ "### Step 4: Check the model accuracy\n",
+ "\n",
+ "Now that the results are in, let's read the results and calculate the accuracy on the question title generation task.\n",
+ "Let's take a look at one of the predictions in the generated output file. The `pred` key indicates what was generated."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "04cb4ae7",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# Take a look at predictions\n",
+ "!head -n1 law_titlegen_lora_test_law_inputs_preds_labels.jsonl"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4e88f3c9",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "For evaluating this task, we will use [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)). It measures overlap of ngrams, and a higher score is better. While it's not perfect and it misses capturing the semantics of the prediction, it is a popular metric in academia and industry for evaluating such systems. \n",
+ "\n",
+ "The following method uses the `rouge_score` library to implement scoring. It will report `ROUGE_{1/2/L/Lsum}` metrics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c4aa9631",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def compute_rouge(input_file: str) -> dict:\n",
+ " ROUGE_KEYS = [\"rouge1\", \"rouge2\", \"rougeL\", \"rougeLsum\"]\n",
+ " scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=True)\n",
+ " aggregator = scoring.BootstrapAggregator()\n",
+ " lines = [json.loads(line) for line in open(input_file)]\n",
+ " num_response_words = []\n",
+ " num_ref_words = []\n",
+ " for idx, line in enumerate(lines):\n",
+ " prompt = line['input']\n",
+ " response = line['pred']\n",
+ " answer = line['label']\n",
+ " scores = scorer.score(response, answer)\n",
+ " aggregator.add_scores(scores)\n",
+ " num_response_words.append(len(response.split()))\n",
+ " num_ref_words.append(len(answer.split()))\n",
+ "\n",
+ " result = aggregator.aggregate()\n",
+ " rouge_scores = {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}\n",
+ " print(rouge_scores)\n",
+ " print(f\"Average and stddev of response length: {np.mean(num_response_words):.2f}, {np.std(num_response_words):.2f}\")\n",
+ " print(f\"Average and stddev of ref length: {np.mean(num_ref_words):.2f}, {np.std(num_ref_words):.2f}\")\n",
+ "\n",
+ " return rouge_scores"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "41c661d5",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "compute_rouge(\"./law_titlegen_lora_test_law_inputs_preds_labels.jsonl\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f48667b1",
+ "metadata": {},
+ "source": [
+ "For the Llama-3.1-8B-Instruct model, you should see accuracy comparable to the below:\n",
+ "```\n",
+ "{'rouge1': 39.2082, 'rouge2': 18.8573, 'rougeL': 35.4098, 'rougeLsum': 35.3906}\n",
+ "```"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}