PacificAI · ArshaanNazir · Dec 1, 2023 · Nov 16, 2023 · Nov 16, 2023 · Nov 16, 2023
diff --git a/README.md b/README.md
@@ -101,7 +101,16 @@ You can check out the following LangTest articles:
 | [**Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test**](https://medium.com/john-snow-labs/evaluating-large-language-models-on-gender-occupational-stereotypes-using-the-wino-bias-test-2a96619b4960) | In this blog post, we dive into testing the WinoBias dataset on LLMs, examining language models’ handling of gender and occupational roles, evaluation metrics, and the wider implications. Let’s explore the evaluation of language models with LangTest on the WinoBias dataset and confront the challenges of addressing bias in AI. |
 | [**Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations**](https://medium.com/john-snow-labs/streamlining-ml-workflows-integrating-mlflow-tracking-with-langtest-for-enhanced-model-evaluations-4ce9863a0ff1) | In this blog post, we dive into the growing need for transparent, systematic, and comprehensive tracking of models. Enter MLFlow and LangTest: two tools that, when combined, create a revolutionary approach to ML development. |
 | [**Testing the Question Answering Capabilities of Large Language Models**](https://medium.com/john-snow-labs/testing-the-question-answering-capabilities-of-large-language-models-1bc424d61740) | In this blog post, we dive into enhancing the QA evaluation capabilities using LangTest library. Explore about different evaluation methods that LangTest offers to address the complexities of evaluating Question Answering (QA) tasks. |
-| [**Evaluating Stereotype Bias with LangTest**](To be published soon) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.|
+| [**Evaluating Stereotype Bias with LangTest**](https://medium.com/john-snow-labs/evaluating-stereotype-bias-with-langtest-8286af8f0f22) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.|
+| [**Unveiling Sentiments: Exploring LSTM-based Sentiment Analysis with PyTorch on the IMDB Dataset**](To be Published) | Explore the robustness of custom models with LangTest Insights.|
+| [**LangTest Insights: A Deep Dive into LLM Robustness on OpenBookQA**](To be Published) | Explore the robustness of Language Models (LLMs) on the OpenBookQA dataset with LangTest Insights.|
+| [**LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models**](To be Published) | Explore the robustness of Transformers Language Models with LangTest Insights.|
+
+
+
+
+
+
 
 
 > **Note**

diff --git a/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb b/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb
diff --git a/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb b/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb
diff --git a/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb b/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb
diff --git a/demo/tutorials/misc/HF_Callback_NER.ipynb b/demo/tutorials/misc/HF_Callback_NER.ipynb
diff --git a/demo/tutorials/misc/HF_Callback_Text_Classification.ipynb b/demo/tutorials/misc/HF_Callback_Text_Classification.ipynb
diff --git a/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb b/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb
diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
@@ -56,6 +56,8 @@ docs-menu:
           url: /docs/pages/docs/report
         - title: MlFlow Tracking
           url: /docs/pages/docs/ml_flow
+        - title: LangTestCallback
+          url: /docs/pages/docs/hf-callback
 
   - title: Saving & Loading
     url: /docs/pages/docs/save
@@ -135,10 +137,6 @@ tutorials:
           url: /docs/pages/tutorials/LLM_testing_Notebooks/sycophancy
         - title: Stereotype
           url: /docs/pages/tutorials/LLM_testing_Notebooks/stereotype
-  - title: Benchmark Dataset Notebooks
-    url: /docs/pages/tutorials/Benchmark_Dataset_Notebook_Notebooks
-  - title: End-to-End Workflow Notebooks
-    url: /docs/pages/tutorials/End_to_End_workflow_Notebooks
   - title: Miscellaneous Notebooks
     url: /docs/pages/tutorials/Miscellaneous_Notebooks
     children:
@@ -150,6 +148,10 @@ tutorials:
           url: /docs/pages/tutorials/misc/different_report_formats
         - title: Editing Testcases
           url: /docs/pages/tutorials/misc/editing-testcases
+  - title: Benchmark Dataset Notebooks
+    url: /docs/pages/tutorials/Benchmark_Dataset_Notebook_Notebooks
+  - title: End-to-End Workflow Notebooks
+    url: /docs/pages/tutorials/End_to_End_workflow_Notebooks
 
 tests:
   - title: Tests
@@ -190,50 +192,63 @@ tests:
     url: /docs/pages/tests/ideology
 
 benchmarks:
-  - title: Benchmarks
-    url: /docs/pages/benchmarks/benchmark
+  - title: Medical
+    url: /docs/pages/benchmarks/medical
+    children:
+      - title: MedMCQA
+        url: /docs/pages/benchmarks/medical/medmcqa
+      - title: MedQA
+        url: /docs/pages/benchmarks/medical/medqa
+      - title: PubMedQA
+        url: /docs/pages/benchmarks/medical/pubmedqa
+  - title:  Commonsense Scenario
+    url: /docs/pages/benchmarks/commonsense_scenario
     children:
-      - title: ASDiv
-        url: /docs/pages/benchmarks/asdiv
-      - title: BBQ
-        url: /docs/pages/benchmarks/bbq
-      - title: Bigbench
-        url: /docs/pages/benchmarks/bigbench
-      - title: BoolQ
-        url: /docs/pages/benchmarks/boolq
       - title: CommonsenseQA
-        url: /docs/pages/benchmarks/commonsenseqa
-      - title: FIQA
-        url: /docs/pages/benchmarks/fiqa
+        url: /docs/pages/benchmarks/commonsense_scenario/commonsenseqa
       - title: HellaSwag
-        url: /docs/pages/benchmarks/hellaswag
-      - title: Consumer-Contracts
-        url: /docs/pages/benchmarks/consumer-contracts
+        url: /docs/pages/benchmarks/commonsense_scenario/hellaswag
+      - title: OpenBookQA
+        url: /docs/pages/benchmarks/commonsense_scenario/openbookqa
+      - title: PIQA
+        url: /docs/pages/benchmarks/commonsense_scenario/piqa
+      - title: SIQA
+        url: /docs/pages/benchmarks/commonsense_scenario/siqa
+  - title : Legal
+    url: /docs/pages/benchmarks/legal
+    children:
       - title: Contracts
-        url: /docs/pages/benchmarks/contracts
+        url: /docs/pages/benchmarks/legal/contracts
+      - title: Consumer-Contracts
+        url: /docs/pages/benchmarks/legal/consumer-contracts
       - title: Privacy-Policy
-        url: /docs/pages/benchmarks/privacy-policy
+        url: /docs/pages/benchmarks/legal/privacy-policy
+      - title: FIQA
+        url: /docs/pages/benchmarks/legal/fiqa
+      - title: MultiLexSum
+        url: /docs/pages/benchmarks/legal/multilexsum
+  - title:  Other Benchmarks
+    url: /docs/pages/benchmarks/other_benchmarks
+    children:
+      - title: ASDiv
+        url: /docs/pages/benchmarks/other_benchmarks/asdiv
+      - title: BBQ
+        url: /docs/pages/benchmarks/other_benchmarks/bbq
+      - title: Bigbench
+        url: /docs/pages/benchmarks/other_benchmarks/bigbench
+      - title: BoolQ
+        url: /docs/pages/benchmarks/other_benchmarks/boolq
       - title: LogiQA
-        url: /docs/pages/benchmarks/logiqa
+        url: /docs/pages/benchmarks/other_benchmark/logiqa
       - title: MMLU
-        url: /docs/pages/benchmarks/mmlu
-      - title: MultiLexSum
-        url: /docs/pages/benchmarks/multilexsum
+        url: /docs/pages/benchmarks/other_benchmarks/mmlu
       - title: NarrativeQA
-        url: /docs/pages/benchmarks/narrativeqa
+        url: /docs/pages/benchmarks/other_benchmarks/narrativeqa
       - title: NQ-open
-        url: /docs/pages/benchmarks/nq-open
-      - title: OpenBookQA
-        url: /docs/pages/benchmarks/openbookqa
-      - title: PIQA
-        url: /docs/pages/benchmarks/piqa
+        url: /docs/pages/benchmarks/other_benchmarks/nq-open
       - title: Quac
-        url: /docs/pages/benchmarks/quac
-      - title: SIQA
-        url: /docs/pages/benchmarks/siqa
+        url: /docs/pages/benchmarks/other_benchmarks/quac
       - title: TruthfulQA
-        url: /docs/pages/benchmarks/truthfulqa
+        url: /docs/pages/benchmarks/other_benchmarks/truthfulqa
       - title: XSum
-        url: /docs/pages/benchmarks/xsum
-
-
+        url: /docs/pages/benchmarks/other_benchmarks/xsum
diff --git a/docs/assets/images/benchmarks/medmcq.png b/docs/assets/images/benchmarks/medmcq.png
diff --git a/docs/assets/images/benchmarks/medqa.png b/docs/assets/images/benchmarks/medqa.png
diff --git a/docs/assets/images/benchmarks/mmlu.png b/docs/assets/images/benchmarks/mmlu.png
diff --git a/docs/assets/images/benchmarks/openbookqa.png b/docs/assets/images/benchmarks/openbookqa.png