Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
69aeca2
Merge pull request #983 from JohnSnowLabs/release/2.0.0
ArshaanNazir Feb 20, 2024
4cabaf5
Add out_praser parameter to PretrainedModel constructor
chakravarthik27 Feb 23, 2024
18cdd92
Add "web" to INSTALLED_HUBS and update chat_completion_api function
chakravarthik27 Feb 24, 2024
c505082
Update lmstudio_modelhandler.py with typing and variable name changes
chakravarthik27 Feb 24, 2024
22b54ca
Refactor hub renaming logic in ModelAPI
chakravarthik27 Feb 24, 2024
4743fb6
Add test case for loading a model from a generic API
chakravarthik27 Feb 26, 2024
82517c7
Refactor TestFactory class to handle exceptions in async tests
chakravarthik27 Feb 28, 2024
897b38b
data augmentation support for question-answering task
chakravarthik27 Mar 1, 2024
2f76360
Add blank line to QAFormatter's to_jsonl method
chakravarthik27 Mar 1, 2024
7ec8e1d
Refactor code in format.py to handle context, options, and target fie…
chakravarthik27 Mar 1, 2024
b5622b9
Merge pull request #986 from JohnSnowLabs/feautre/integration-with-we…
chakravarthik27 Mar 3, 2024
113fc64
Merge pull request #990 from JohnSnowLabs/fix/missing-error-handling-…
chakravarthik27 Mar 3, 2024
b5565eb
Merge pull request #991 from JohnSnowLabs/featue/explore-data-augment…
chakravarthik27 Mar 3, 2024
3dbb570
Update dependencies in pyproject.toml
chakravarthik27 Mar 5, 2024
52cb0af
Update langchain version to 0.1.11
chakravarthik27 Mar 5, 2024
8cbbeae
Update pydantic version to 1.10.8
chakravarthik27 Mar 5, 2024
748b171
Update function to handle edge cases
chakravarthik27 Mar 5, 2024
b041005
Update force-cpu-torch installation command
chakravarthik27 Mar 6, 2024
3184fb6
add support for multiple data format using pandas into BaseDataset an…
chakravarthik27 Mar 12, 2024
08a7445
Refactor token validation method signature
chakravarthik27 Mar 12, 2024
9574dba
Refactored file extension mapping in PandasDataset class
chakravarthik27 Mar 12, 2024
1671133
fixed linting issue
chakravarthik27 Mar 12, 2024
ea4966f
Refactor data source handling and add support for additional file ext…
chakravarthik27 Mar 13, 2024
df94c44
Add PandasDataset tests for loading data from different file formats
chakravarthik27 Mar 13, 2024
6882a40
Add extra-lib installation task to build and test workflow
chakravarthik27 Mar 13, 2024
f5efdd2
Refactor accuracy calculation in LLMEval class
chakravarthik27 Mar 17, 2024
aa6295e
Add global dataset config variable
chakravarthik27 Mar 18, 2024
372f5f5
Refactor multi_dataset_report and LLMEval class
chakravarthik27 Mar 18, 2024
43864c8
Refactor code to remove unused imports and fix formatting
chakravarthik27 Mar 18, 2024
4e18608
Remove unnecessary print statement in Harness class
chakravarthik27 Mar 19, 2024
78eb2aa
Clear batches info if it exists in Harness class
chakravarthik27 Mar 19, 2024
6f64c00
Update function to handle test cases method in Harness
chakravarthik27 Mar 19, 2024
51f2c7c
Add support for hdf5 file extension in ext_map
chakravarthik27 Mar 20, 2024
d82f090
Refactor code to improve readability and handle different scenarios i…
chakravarthik27 Mar 21, 2024
89636cb
* BugFix in accuracy run method (eval_model = None) after load method
chakravarthik27 Mar 21, 2024
8672677
Add leaderboard functionality
chakravarthik27 Mar 22, 2024
c9a5c27
fix: model_response function for multi-dataset.
chakravarthik27 Mar 25, 2024
5539e08
Add dataset_name column for multi-dataset
chakravarthik27 Mar 25, 2024
41f93a7
fix: Renamed ModelResponse cls to TestResultManager cls
chakravarthik27 Mar 25, 2024
3dc078d
update: for fairness test cls
chakravarthik27 Mar 25, 2024
082baae
Refactor leaderboard.py and add checkpoints directory
chakravarthik27 Mar 27, 2024
3ad86cb
Add new error messages and update checkpoint deletion logic
chakravarthik27 Mar 27, 2024
986e54e
Remove unnecessary whitespace in CheckpointManager class
chakravarthik27 Mar 27, 2024
73d2504
leaderboard.py: Refactored logic for configuring the harness and savi…
Prikshit7766 Mar 27, 2024
6d8ef40
Add "reports" directory to required_dirs list
Prikshit7766 Mar 27, 2024
f9676f7
Refactor leaderboard output-dir logic
Prikshit7766 Mar 27, 2024
f3f2560
some comments add in langtest.py
chakravarthik27 Mar 27, 2024
8ce959e
updated leaderboard.py and helpers.py
Prikshit7766 Mar 28, 2024
5bfa8f4
fix score logic for accuracy and robustness
Prikshit7766 Mar 28, 2024
0cc9423
leaderboard.py: added method for saving accuracy, robustness summary …
Prikshit7766 Mar 28, 2024
f8863d2
Merge pull request #992 from JohnSnowLabs/Improvement/resolve-the-dep…
chakravarthik27 Mar 28, 2024
dfe873c
Merge remote-tracking branch 'origin/release/2.0.1' into feature/add-…
chakravarthik27 Mar 28, 2024
9576d2c
Update pyproject.toml with new dependencies
chakravarthik27 Mar 28, 2024
b86b560
Merge pull request #998 from JohnSnowLabs/fix/implement-the-multiple-…
chakravarthik27 Mar 29, 2024
e30c523
resolved: overridig the _generated_result
chakravarthik27 Mar 29, 2024
42bc488
Refactor conditional statement in langtest.py
chakravarthik27 Mar 29, 2024
1acc6a5
Merge branch 'fix/implement-the-multiple-dataset-support-for-accuracy…
Prikshit7766 Mar 29, 2024
3a9bec5
added check in BLEU score
Prikshit7766 Mar 29, 2024
b2dce2d
Merge pull request #993 from JohnSnowLabs/feature/add-support-for-oth…
chakravarthik27 Mar 29, 2024
a40fcd6
Merge pull request #1000 from JohnSnowLabs/fix/implement-the-multiple…
chakravarthik27 Mar 29, 2024
6d0a564
added support for muti-dataset in leaderboard and added update_leader…
Prikshit7766 Mar 29, 2024
bd4a277
Merge branch 'release/2.0.1' of https://github.com/JohnSnowLabs/langt…
Prikshit7766 Mar 30, 2024
d1769b8
Add blank line for readability in helpers.py
chakravarthik27 Mar 30, 2024
f65712c
updated leaderboard.py
Prikshit7766 Mar 30, 2024
ba4b7c7
Refactor JSONLDataset to aggregate JSONL files
chakravarthik27 Mar 30, 2024
94e6311
updated datasource
Prikshit7766 Mar 30, 2024
94b8eb8
updated leaderboard.py
Prikshit7766 Mar 30, 2024
666278a
resolved: default datasets paths
chakravarthik27 Mar 31, 2024
da2d5cd
Fix dataset name typo in SecuritySample class
chakravarthik27 Mar 31, 2024
c26b258
Fix condition for checking custom labels in DataFactory class
chakravarthik27 Mar 31, 2024
8b871b5
Fix JSONL file loading in DataFactory and JSONLDataset
chakravarthik27 Mar 31, 2024
1733e07
Fix type hinting and formatting in datasource.py and leaderboard.py
chakravarthik27 Mar 31, 2024
67b8841
Add saving generated results and model responses to CSV files, and pr…
chakravarthik27 Apr 1, 2024
fe2bc19
Add show-leaderboard command to langtest leaderboard.py
chakravarthik27 Apr 1, 2024
439ac90
Added: Updated Multi_Dataset Notebook and another notebooks
chakravarthik27 Apr 1, 2024
46c26de
Add MMLU dataset to JSONLDataset in aggregate_jsonl()
chakravarthik27 Apr 1, 2024
ff9bf2a
Merge pull request #999 from JohnSnowLabs/feature/Implement-Load-&-Sa…
chakravarthik27 Apr 1, 2024
72ee3dc
Update image and URL paths in README.md and fix dataset_name in datas…
chakravarthik27 Apr 1, 2024
ff0be77
Add default value for __is_multi_model = False for multi-model in Har…
chakravarthik27 Apr 1, 2024
b8b4102
fix edit and import testcases for multi-dataset and single-dataset.
chakravarthik27 Apr 1, 2024
6011da8
Add comment in import edited testcases - single dataset case
chakravarthik27 Apr 1, 2024
2c40b93
fix default .langtest folder in UserProfile.
chakravarthik27 Apr 1, 2024
2a90cb6
Fix typo in get_embedding method call
Prikshit7766 Apr 1, 2024
c408e15
Add server_prompt parameter in the run method for summarization task
Prikshit7766 Apr 1, 2024
a6350e9
Merge pull request #1003 from JohnSnowLabs/fix/bug-fixes-langtest-2-1…
chakravarthik27 Apr 2, 2024
7cc6b42
Update get_parameters function to support both .yml and .yaml file ex…
chakravarthik27 Apr 2, 2024
94130aa
Add section on testing API-based models
chakravarthik27 Apr 2, 2024
faf004a
Merge remote-tracking branch 'origin/gh-pages' into chore/final_websi…
chakravarthik27 Apr 2, 2024
f407a3e
Merge remote-tracking branch 'origin/release/2.0.1' into chore/final_…
chakravarthik27 Apr 2, 2024
58c846e
Updated NB and previous release notes
chakravarthik27 Apr 2, 2024
7bc4975
added 2.1.0 release notes
chakravarthik27 Apr 2, 2024
95433f1
Merge pull request #1001 from JohnSnowLabs/chore/final_website_updates
chakravarthik27 Apr 2, 2024
d0ba9fb
Refactor save_file function to include type hinting for file_path par…
chakravarthik27 Apr 2, 2024
5776691
Fix spacing in save_file function signature
chakravarthik27 Apr 2, 2024
1e3d69f
update release 2.0.0 to 2.1.0
chakravarthik27 Apr 2, 2024
4c374a7
Merge pull request #1004 from JohnSnowLabs/fix/bug-fixes-langtest-2-1…
chakravarthik27 Apr 3, 2024
4e6a83b
Merge pull request #1005 from JohnSnowLabs/release/2.0.1
chakravarthik27 Apr 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,5 @@ jobs:
- name: Test with pytest
run: |
poetry run task force-cpu-torch
poetry run task extra-lib
poetry run task test
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<p align="center">
<img src="docs/assets/images/langtest/langtest_logo.png" alt="johnsnowlabs_logo" width="360" style="text-align:center;">
<img src="https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/docs/assets/images/langtest/langtest_logo.png" alt="johnsnowlabs_logo" width="360" style="text-align:center;">
</p>

<div align="center">
Expand Down Expand Up @@ -35,7 +35,7 @@
<img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg">
</a>

![Langtest Workflow](docs/assets/images/langtest/langtest_flow_graphic.jpeg)
![Langtest Workflow](https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/docs/assets/images/langtest/langtest_flow_graphic.jpeg)

<p align="center">
<a href="https://langtest.org/">Project's Website</a> •
Expand Down
560 changes: 560 additions & 0 deletions demo/tutorials/benchmarks/Langtest_Cli_Eval_Command.ipynb

Large diffs are not rendered by default.

2,745 changes: 2,745 additions & 0 deletions demo/tutorials/misc/Generic_API-Based_Model_Testing_Demo.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion demo/tutorials/misc/Multiple_dataset.ipynb

Large diffs are not rendered by default.

322 changes: 236 additions & 86 deletions docs/pages/docs/langtest_versions/latest_release.md

Large diffs are not rendered by default.

145 changes: 145 additions & 0 deletions docs/pages/docs/langtest_versions/release_notes_1_10_0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
layout: docs
header: true
seotitle: LangTest - Deliver Safe and Effective Language Models | John Snow Labs
title: LangTest Release Notes
permalink: /docs/pages/docs/langtest_versions/release_notes_1_10_0
key: docs-release-notes
modify_date: 2023-10-17
---

<div class="h3-box" markdown="1">

## 1.10.0

## 📢 Highlights


🌟 **LangTest 1.10.0 Release by John Snow Labs**

We're thrilled to announce the latest release of LangTest, introducing remarkable features that elevate its capabilities and user-friendliness. This update brings a host of enhancements:

- **Evaluating RAG with LlamaIndex and Langtest**: LangTest seamlessly integrates LlamaIndex for constructing a RAG and employs LangtestRetrieverEvaluator, measuring retriever precision (Hit Rate) and accuracy (MRR) with both standard and perturbed queries, ensuring robust real-world performance assessment.

- **Grammar Testing for NLP Model Evaluation:** This approach entails creating test cases through the paraphrasing of original sentences. The purpose is to evaluate a language model's proficiency in understanding and interpreting the nuanced meaning of the text, enhancing our understanding of its contextual comprehension capabilities.


- **Saving and Loading the Checkpoints:** LangTest now supports the seamless saving and loading of checkpoints, providing users with the ability to manage task progress, recover from interruptions, and ensure data integrity.

- **Extended Support for Medical Datasets:** LangTest adds support for additional medical datasets, including LiveQA, MedicationQA, and HealthSearchQA. These datasets enable a comprehensive evaluation of language models in diverse medical scenarios, covering consumer health, medication-related queries, and closed-domain question-answering tasks.


- **Direct Integration with Hugging Face Models:** Users can effortlessly pass any Hugging Face model object into the LangTest harness and run a variety of tasks. This feature streamlines the process of evaluating and comparing different models, making it easier for users to leverage LangTest's comprehensive suite of tools with the wide array of models available on Hugging Face.


</div><div class="h3-box" markdown="1">

## 🔥 Key Enhancements:

### 🚀Implementing and Evaluating RAG with LlamaIndex and Langtest
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/JohnSnowLabs/langtest/blob/main/demo/tutorials/RAG/RAG_OpenAI.ipynb)

LangTest seamlessly integrates LlamaIndex, focusing on two main aspects: constructing the RAG with LlamaIndex and evaluating its performance. The integration involves utilizing LlamaIndex's generate_question_context_pairs module to create relevant question and context pairs, forming the foundation for retrieval and response evaluation in the RAG system.

To assess the retriever's effectiveness, LangTest introduces LangtestRetrieverEvaluator, employing key metrics such as Hit Rate and Mean Reciprocal Rank (MRR). Hit Rate gauges the precision by assessing the percentage of queries with the correct answer in the top-k retrieved documents. MRR evaluates the accuracy by considering the rank of the highest-placed relevant document across all queries. This comprehensive evaluation, using both standard and perturbed queries generated through LangTest, ensures a thorough understanding of the retriever's robustness and adaptability under various conditions, reflecting its real-world performance.

```
from langtest.evaluation import LangtestRetrieverEvaluator

retriever_evaluator = LangtestRetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"], retriever=retriever
)

retriever_evaluator.setPerturbations("add_typo","dyslexia_word_swap", "add_ocr_typo")

# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

retriever_evaluator.display_results()

```

### 📚Grammar Testing in Evaluating and Enhancing NLP Models
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/JohnSnowLabs/langtest/blob/main/demo/tutorials/test-specific-notebooks/Grammar_Demo.ipynb)

Grammar Testing is a key feature in LangTest's suite of evaluation strategies, emphasizing the assessment of a language model's proficiency in contextual understanding and nuance interpretation. By creating test cases that paraphrase original sentences, the goal is to gauge the model's ability to comprehend and interpret text, thereby enriching insights into its contextual mastery.

{:.table3}
| Category | Test Type | Original | Test Case | Expected Result | Actual Result | Pass |
|----------|------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------:|------------------|---------------|-------|
| grammar | paraphrase | This program was on for a brief period when I was a kid, I remember watching it whilst eating fish and chips.<br /><br />Riding on the back of the Tron hype this series was much in the style of streethawk, manimal and the like, except more computery. There was a geeky kid who's computer somehow created this guy - automan. He'd go around solving crimes and the lot.<br /><br />All I really remember was his fancy car and the little flashy cursor thing that used to draw the car and help him out generally.<br /><br />When I mention it to anyone they can remember very little too. Was it real or maybe a dream? | I remember watching a show from my youth that had a Tron theme, with a nerdy kid driving around with a little flashy cursor and solving everyday problems. Was it a genuine story or a mere dream come true? | NEGATIVE | POSITIVE | false |

### 🔥 Saving and Loading the Checkpoints
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Saving_Checkpoints.ipynb)
Introducing a robust checkpointing system in LangTest! The `run` method in the `Harness` class now supports checkpointing, allowing users to save intermediate results, manage batch processing, and specify a directory for storing checkpoints and results. This feature ensures data integrity, providing a mechanism for recovering progress in case of interruptions or task failures.
```
harness.run(checkpoint=True, batch_size=20,save_checkpoints_dir="imdb-checkpoint")
```
The `load_checkpoints` method facilitates the direct loading of saved checkpoints and data, providing a convenient mechanism to resume testing tasks from the point where they were previously interrupted, even in the event of runtime failures or errors.
```
harness = Harness.load_checkpoints(save_checkpoints_dir="imdb-checkpoint",
task="text-classification",
model = {"model": "lvwerra/distilbert-imdb" , "hub":"huggingface"}, )
```

### 🏥 Added Support for More Medical Datasets

#### LiveQA
The LiveQA'17 medical task focuses on consumer health question answering. It consists of constructed medical question-answer pairs for training and testing, with additional annotations. LangTest now supports LiveQA for comprehensive medical evaluation.

##### How the dataset looks:

{:.table3}
| category | test_type | original_question | perturbed_question | expected_result | actual_result | eval_score | pass |
|------------|-----------|------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------------------------|------------------------------------------------------------|------------|------|
| robustness | uppercase | Do amphetamine salts 20mg tablets contain gluten? | DO AMPHETAMINE SALTS 20MG TABLETS CONTAIN GLUTEN? | No, amphetamine salts 20mg tablets do not contain gluten. | No, Amphetamine Salts 20mg Tablets do not contain gluten. | 1.0 | true |

#### MedicationQA
The MedicationQA dataset consists of commonly asked consumer questions about medications. It includes annotations corresponding to drug focus and interactions. LangTest now integrates MedicationQA for thorough evaluation of models in medication-related scenarios.

##### How the dataset looks:

{:.table3}
| category | test_type | original_question | perturbed_question | expected_result | actual_result | eval_score | pass |
|------------|-----------|-----------------------------------------------------|--------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|------|
| robustness | uppercase | how does rivatigmine and otc sleep medicine interact | HOW DOES RIVATIGMINE AND OTC SLEEP MEDICINE INTERACT | Rivastigmine is a cholinesterase inhibitor and OTC (over-the-counter) sleep medicine is a sedative. There is a risk of excessive sedation when taking both of these medications together. Patients should consult their doctor before taking both of these medications together. | Rivastigmine and OTC sleep medicines may interact. Rivastigmine can increase the effects of sedatives, including OTC sleep medicines, resulting in increased drowsiness. It's important to talk to your doctor before taking OTC sleep medicines while taking Rivastigmine. | 0.9365371414708803 | true |


#### HealthSearchQA
HealthSearchQA, a new free-response dataset of medical questions sought online, with six existing open-question answering datasets covering professional medical exams, research, and consumer queries. The HealthsearchQA dataset consists of 3375 frequently-asked consumer questions and was curated using seed medical diagnoses and their related symptoms.

##### How the dataset looks:

{:.table3}
| category | test_type | original_question | perturbed_question | expected_result | actual_result | eval_score | pass |
|------------|-----------|------------------------------------------|---------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|---------------------|------|
| robustness | uppercase | Are dental abscesses serious? | ARE DENTAL ABSCESSES SERIOUS? | Yes, dental abscesses are serious and can lead to infection and damage to the surrounding tissue if left untreated. Treatment typically involves antibiotics and/or draining the abscess. If left untreated, the infection can spread to other parts of the body. | Dental abscesses can be serious and require prompt medical attention. Left untreated, they can cause swelling, spreading infections, and damage to the surrounding teeth and bone. | 0.9457038739103363 | true |



### 🚀Direct Integration with Hugging Face Models

Users can effortlessly pass any Hugging Face model object into the LangTest harness and run a variety of tasks. This feature streamlines the process of evaluating and comparing different models, making it easier for users to leverage LangTest's comprehensive suite of tools with the wide array of models available on Hugging Face.

![image](https://github.com/JohnSnowLabs/langtest/assets/71844877/adef09b7-e33d-42ec-86f3-a96dea85387e)


## 🚀 New LangTest Blogs:

{:.table2}
| Blog | Description |
| --- | --- |
| [LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models](https://www.johnsnowlabs.com/langtest-a-secret-weapon-for-improving-the-robustness-of-your-transformers-language-models/) | Explore the robustness of Transformers Language Models with LangTest Insights. |
| [Testing the Robustness of LSTM-Based Sentiment Analysis Models](https://medium.com/john-snow-labs/testing-the-robustness-of-lstm-based-sentiment-analysis-models-67ed84e42997) | Explore the robustness of custom models with LangTest Insights. |

## 🐛 Bug Fixes

- Fixed LangTestCallback errors
- Fixed QA, Default Config, and Transformer Model for QA
- Fixed multi-model evaluation
- Fixed datasets format

## ⚒️ Previous Versions

</div>
{%- include docs-langtest-pagination.html -%}
Loading