Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
0ebe067
remove NB
ArshaanNazir Nov 16, 2023
4ddc8d0
updated langtest for saving generated_results
Prikshit7766 Nov 16, 2023
5a81983
updated test_harness.py
Prikshit7766 Nov 16, 2023
a650ac2
langtest.py: Get qaevalchain from config
RakshitKhajuria Nov 17, 2023
c68b21e
updated sample.py
Prikshit7766 Nov 17, 2023
00716aa
Merge branch 'benchmark-task' of https://github.com/JohnSnowLabs/lang…
RakshitKhajuria Nov 17, 2023
3078d00
black format
RakshitKhajuria Nov 17, 2023
7739ff3
minor fix
Prikshit7766 Nov 17, 2023
f60d534
fix lint and doc update
RakshitKhajuria Nov 17, 2023
8740862
formatted test_harness.py
Prikshit7766 Nov 17, 2023
e7b8c98
updated sample.py
Prikshit7766 Nov 17, 2023
6bf4b83
update website
ArshaanNazir Nov 17, 2023
9cf33d3
Rename GLOBAL_MODEL --> EVAL_MODEL
RakshitKhajuria Nov 17, 2023
c739d24
Merge branch 'fix/evalchain' of https://github.com/JohnSnowLabs/langt…
RakshitKhajuria Nov 17, 2023
e37cee0
QAEvalChain eval_model fix
Prikshit7766 Nov 18, 2023
2206afa
Add predict_raw method to PretrainedCustomModel
chakravarthik27 Nov 20, 2023
26c44cf
remove QAEval chain reference
ArshaanNazir Nov 20, 2023
c0606c9
Merge branch 'fix/website' of https://github.com/JohnSnowLabs/langtes…
ArshaanNazir Nov 20, 2023
b99b759
Fix evaluation logic in QASample class
chakravarthik27 Nov 21, 2023
927a45a
update sample
ArshaanNazir Nov 21, 2023
bad29df
fix linting
ArshaanNazir Nov 21, 2023
739f2bd
remove unnecassary print
ArshaanNazir Nov 21, 2023
421e904
added college_biology_test high_school_biology_test test set
RakshitKhajuria Nov 21, 2023
bf96871
added to path
RakshitKhajuria Nov 21, 2023
fb512b6
added college_biology_test high_school_biology_test test set
RakshitKhajuria Nov 21, 2023
fad52a2
added medicine genetics knowledge dataset
Prikshit7766 Nov 21, 2023
58791b7
Merge branch 'mmlu-clinical' of https://github.com/JohnSnowLabs/langt…
Prikshit7766 Nov 21, 2023
292cbe6
fix typo
Prikshit7766 Nov 21, 2023
baa99c7
added 3 datsets
RakshitKhajuria Nov 21, 2023
5eb8c04
added dental ent and foremsic
RakshitKhajuria Nov 21, 2023
cf84140
added gymo, med, microbio, opht
RakshitKhajuria Nov 21, 2023
a2d6d15
added to path
RakshitKhajuria Nov 21, 2023
5a4cb8a
minor fix
RakshitKhajuria Nov 21, 2023
261b713
added pathology, pediatrics, pharmacology dataset
Prikshit7766 Nov 21, 2023
5daa80d
added physiology psychiatry, radiology dataset
Prikshit7766 Nov 21, 2023
ee83632
added Skin, Social_Preventive_Medicine, Surgery, Unknown dataset
Prikshit7766 Nov 21, 2023
46066f7
added to setup
RakshitKhajuria Nov 21, 2023
8675e59
Merge branch 'mmlu-clinical' of https://github.com/JohnSnowLabs/langt…
RakshitKhajuria Nov 21, 2023
3f01be1
added default prompt
Prikshit7766 Nov 21, 2023
afde918
updated datasource
Prikshit7766 Nov 21, 2023
1171f86
Merge branch 'fix/evalchain' of https://github.com/JohnSnowLabs/langt…
Prikshit7766 Nov 21, 2023
4818631
Merge branch 'fix/evalchain' of https://github.com/JohnSnowLabs/langt…
RakshitKhajuria Nov 21, 2023
1d19b45
Merge branch 'mmlu-clinical' of https://github.com/JohnSnowLabs/langt…
RakshitKhajuria Nov 21, 2023
3e02595
updated sample.py
Prikshit7766 Nov 22, 2023
e4f12c6
added medqa
RakshitKhajuria Nov 22, 2023
b027694
added medqa to path
RakshitKhajuria Nov 22, 2023
8959c54
Merge branch 'mmlu-clinical' of https://github.com/JohnSnowLabs/langt…
RakshitKhajuria Nov 22, 2023
10334e4
add MedMCQ helper prompt
ArshaanNazir Nov 22, 2023
3bb5098
Merge branch 'mmlu-clinical' of https://github.com/JohnSnowLabs/langt…
ArshaanNazir Nov 22, 2023
7ee596b
update prompt
ArshaanNazir Nov 22, 2023
bc11a92
update medqa prompt
ArshaanNazir Nov 23, 2023
e5c2b36
update sample
ArshaanNazir Nov 23, 2023
56d328a
fix linting
ArshaanNazir Nov 23, 2023
c76afa9
add callback class
alytarik Nov 23, 2023
ac62396
create callback notebook
alytarik Nov 23, 2023
29f282f
update notebook
alytarik Nov 23, 2023
55c2c8a
updated langtest.py to load testcases
Prikshit7766 Nov 23, 2023
326d1c2
error.py: added warning
Prikshit7766 Nov 23, 2023
c79cf3c
added pubmed qa subsets
RakshitKhajuria Nov 23, 2023
0a77a4b
added to path
RakshitKhajuria Nov 23, 2023
806efa3
added MedMCQ-Validation
Prikshit7766 Nov 24, 2023
7e86c74
added path in datasource
Prikshit7766 Nov 24, 2023
11bba1e
update is_pass
ArshaanNazir Nov 24, 2023
279a618
added default prompt and updated sample.py
Prikshit7766 Nov 24, 2023
12d460b
Merge branch 'mmlu-clinical' of https://github.com/JohnSnowLabs/langt…
Prikshit7766 Nov 24, 2023
ea63968
add callback class information
alytarik Nov 24, 2023
665981d
Add support for ChatOpenAI model in
chakravarthik27 Nov 24, 2023
3f930a2
Add support for gpt-4-turbo model
chakravarthik27 Nov 24, 2023
22bf053
add option to generate sample templates
ArshaanNazir Nov 27, 2023
2fa2f36
update prompt
ArshaanNazir Nov 27, 2023
eeb27d9
update templatic augmentation
ArshaanNazir Nov 27, 2023
d9a15fc
updating generate_templates functionality
ArshaanNazir Nov 27, 2023
a9fddbd
fix linting
ArshaanNazir Nov 27, 2023
5abaf46
Update model path in PretrainedModelForQA class
chakravarthik27 Nov 28, 2023
d2521f6
update error
ArshaanNazir Nov 28, 2023
8598407
add new error code
ArshaanNazir Nov 28, 2023
c457157
fix linting
ArshaanNazir Nov 28, 2023
d4b9bd5
Merge pull request #898 from JohnSnowLabs/Fix/pretrainedcustommodel-l…
ArshaanNazir Nov 28, 2023
6adc444
Merge pull request #901 from JohnSnowLabs/fix/templatic_augmentation
ArshaanNazir Nov 28, 2023
a0abafb
added benchmark table for openbook
RakshitKhajuria Nov 28, 2023
3677949
updated mmlu.md
Prikshit7766 Nov 28, 2023
71b8c75
Fix evaluation logic for sample classes
chakravarthik27 Nov 29, 2023
a7456a8
rename MedMCQ -> MedMCQA
Prikshit7766 Nov 29, 2023
434e185
updated sample.py
Prikshit7766 Nov 29, 2023
555840c
add helper prompt for pubmedqa
ArshaanNazir Nov 29, 2023
4969a36
update QA sample and data files
ArshaanNazir Nov 29, 2023
0c28cc4
update data files for Pubmed
ArshaanNazir Nov 29, 2023
42b9557
Merge pull request #896 from JohnSnowLabs/fix/evalchain
ArshaanNazir Nov 29, 2023
6b52f68
Merge pull request #900 from JohnSnowLabs/mmlu-clinical
ArshaanNazir Nov 29, 2023
e44e7f0
use display instead of print
alytarik Nov 29, 2023
848c47f
updated navigation.yml
Prikshit7766 Nov 29, 2023
28d4dc0
added reports
RakshitKhajuria Nov 29, 2023
563e42b
added medmcqa
RakshitKhajuria Nov 29, 2023
57918dd
cropped image
RakshitKhajuria Nov 29, 2023
3e3c601
Fix formatting in errors.py
chakravarthik27 Nov 29, 2023
00dd03a
Merge branch 'release/1.9.0' of https://github.com/JohnSnowLabs/langt…
Prikshit7766 Nov 29, 2023
7ac2cc4
added pubmed
RakshitKhajuria Nov 29, 2023
81c9a74
Merge pull request #903 from JohnSnowLabs/feature/hf-callback
ArshaanNazir Nov 29, 2023
92ce1e3
Merge branch 'fix/website' of https://github.com/JohnSnowLabs/langtes…
Prikshit7766 Nov 29, 2023
e5225c0
minor fix
Prikshit7766 Nov 29, 2023
605b2a4
added medqa.md
Prikshit7766 Nov 29, 2023
092c2cf
added datasets to table - no links added
RakshitKhajuria Nov 29, 2023
f67ce5f
Merge branch 'fix/website' of https://github.com/JohnSnowLabs/langtes…
RakshitKhajuria Nov 29, 2023
692c627
updated images links
Prikshit7766 Nov 29, 2023
4ca8b6c
update LangTestCallback
ArshaanNazir Nov 29, 2023
76ddb12
Merge pull request #905 from JohnSnowLabs/feature/hf-callback
ArshaanNazir Nov 29, 2023
8b7465f
added OpenbookQA_benchmarks notebook
Prikshit7766 Nov 29, 2023
6df2799
Merge branches 'fix/website' and 'fix/website' of https://github.com/…
RakshitKhajuria Nov 30, 2023
a5465ce
Fix evaluation logic for is_pass method
chakravarthik27 Nov 30, 2023
3090d6b
added benchmark
RakshitKhajuria Nov 30, 2023
2204a9a
added test tiny version of medqa
RakshitKhajuria Nov 30, 2023
f15b423
added benchmark table
Prikshit7766 Nov 30, 2023
5fc0dc1
minor fix
Prikshit7766 Nov 30, 2023
02f3a7f
updated setup.py
Prikshit7766 Nov 30, 2023
fd6d6ec
update notebooks
alytarik Nov 30, 2023
743577d
added medqa report image
RakshitKhajuria Nov 30, 2023
71c1a4a
Merge branch 'fix/website' of https://github.com/JohnSnowLabs/langtes…
RakshitKhajuria Nov 30, 2023
82ee42e
updated medqa.md
Prikshit7766 Nov 30, 2023
85dd7e3
add parameter to show generated samples
ArshaanNazir Nov 30, 2023
e0543f0
fix linting
ArshaanNazir Nov 30, 2023
f30fe79
add docstrings
ArshaanNazir Nov 30, 2023
bca83b6
add website page for langtestcallback
alytarik Nov 30, 2023
0cbe241
add langtestcallback to navigation
alytarik Nov 30, 2023
a736274
update notebooks
alytarik Nov 30, 2023
5c4faf8
Update transformers version to 4.34.1
chakravarthik27 Nov 30, 2023
d1b4ee0
Update transformers version in pyproject.toml
chakravarthik27 Nov 30, 2023
b536495
Update transformers version to 4.35
chakravarthik27 Nov 30, 2023
199f4e9
Merge pull request #906 from JohnSnowLabs/fix/duplicate-runs-in-harness
ArshaanNazir Nov 30, 2023
c9bdeea
Merge pull request #908 from JohnSnowLabs/fix/templatic_augmentation
ArshaanNazir Nov 30, 2023
abbb729
categorization of benchmark datasets
Prikshit7766 Nov 30, 2023
1d4df7b
updated link in Benchmark_Dataset_Notebook
Prikshit7766 Nov 30, 2023
d6f79e0
update templatic augmentaation NB
ArshaanNazir Nov 30, 2023
97c17c2
Merge pull request #907 from JohnSnowLabs/docs/callback-nb
ArshaanNazir Nov 30, 2023
d5b08c1
Add blog link to readme
ArshaanNazir Nov 30, 2023
9aa064b
add langtestcallback notebooks to misc notebooks page
alytarik Nov 30, 2023
bdfcef5
updated navigation.yml
Prikshit7766 Nov 30, 2023
63cf0d0
fix benchmark pages
Prikshit7766 Nov 30, 2023
fc2f195
Merge branch 'fix/website' of https://github.com/JohnSnowLabs/langtes…
Prikshit7766 Nov 30, 2023
9b29425
update HF callback NB
ArshaanNazir Nov 30, 2023
0372103
Merge branch 'release/1.9.0' into fix/website
alytarik Nov 30, 2023
6d86bdd
add !wget <file> for testing files in notebooks
alytarik Nov 30, 2023
a817d08
Merge branch 'fix/website' of https://github.com/JohnSnowLabs/langtes…
alytarik Nov 30, 2023
32dca6f
fix SEO titles
ArshaanNazir Nov 30, 2023
6c76867
minor fix
Prikshit7766 Nov 30, 2023
18ffeca
Merge branch 'fix/website' of https://github.com/JohnSnowLabs/langtes…
Prikshit7766 Nov 30, 2023
471aa25
update pqal.jsonl with correct format
ArshaanNazir Dec 1, 2023
a05eccc
Add medical datasets NB
ArshaanNazir Dec 1, 2023
7d2bbd9
update notebook links on website
ArshaanNazir Dec 1, 2023
3546be0
Merge pull request #899 from JohnSnowLabs/fix/website
ArshaanNazir Dec 1, 2023
7e211cb
update README with new blogs
ArshaanNazir Dec 1, 2023
a8e9b0b
update library version
ArshaanNazir Dec 1, 2023
b5f7b32
update README
ArshaanNazir Dec 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,16 @@ You can check out the following LangTest articles:
| [**Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test**](https://medium.com/john-snow-labs/evaluating-large-language-models-on-gender-occupational-stereotypes-using-the-wino-bias-test-2a96619b4960) | In this blog post, we dive into testing the WinoBias dataset on LLMs, examining language models’ handling of gender and occupational roles, evaluation metrics, and the wider implications. Let’s explore the evaluation of language models with LangTest on the WinoBias dataset and confront the challenges of addressing bias in AI. |
| [**Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations**](https://medium.com/john-snow-labs/streamlining-ml-workflows-integrating-mlflow-tracking-with-langtest-for-enhanced-model-evaluations-4ce9863a0ff1) | In this blog post, we dive into the growing need for transparent, systematic, and comprehensive tracking of models. Enter MLFlow and LangTest: two tools that, when combined, create a revolutionary approach to ML development. |
| [**Testing the Question Answering Capabilities of Large Language Models**](https://medium.com/john-snow-labs/testing-the-question-answering-capabilities-of-large-language-models-1bc424d61740) | In this blog post, we dive into enhancing the QA evaluation capabilities using LangTest library. Explore about different evaluation methods that LangTest offers to address the complexities of evaluating Question Answering (QA) tasks. |
| [**Evaluating Stereotype Bias with LangTest**](To be published soon) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.|
| [**Evaluating Stereotype Bias with LangTest**](https://medium.com/john-snow-labs/evaluating-stereotype-bias-with-langtest-8286af8f0f22) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.|
| [**Unveiling Sentiments: Exploring LSTM-based Sentiment Analysis with PyTorch on the IMDB Dataset**](To be Published) | Explore the robustness of custom models with LangTest Insights.|
| [**LangTest Insights: A Deep Dive into LLM Robustness on OpenBookQA**](To be Published) | Explore the robustness of Language Models (LLMs) on the OpenBookQA dataset with LangTest Insights.|
| [**LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models**](To be Published) | Explore the robustness of Transformers Language Models with LangTest Insights.|








> **Note**
Expand Down
22,506 changes: 22,506 additions & 0 deletions demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb

Large diffs are not rendered by default.

5,255 changes: 5,255 additions & 0 deletions demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

668 changes: 668 additions & 0 deletions demo/tutorials/misc/HF_Callback_NER.ipynb

Large diffs are not rendered by default.

6,207 changes: 6,207 additions & 0 deletions demo/tutorials/misc/HF_Callback_Text_Classification.ipynb

Large diffs are not rendered by default.

2,655 changes: 2,654 additions & 1 deletion demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb

Large diffs are not rendered by default.

93 changes: 54 additions & 39 deletions docs/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ docs-menu:
url: /docs/pages/docs/report
- title: MlFlow Tracking
url: /docs/pages/docs/ml_flow
- title: LangTestCallback
url: /docs/pages/docs/hf-callback

- title: Saving & Loading
url: /docs/pages/docs/save
Expand Down Expand Up @@ -135,10 +137,6 @@ tutorials:
url: /docs/pages/tutorials/LLM_testing_Notebooks/sycophancy
- title: Stereotype
url: /docs/pages/tutorials/LLM_testing_Notebooks/stereotype
- title: Benchmark Dataset Notebooks
url: /docs/pages/tutorials/Benchmark_Dataset_Notebook_Notebooks
- title: End-to-End Workflow Notebooks
url: /docs/pages/tutorials/End_to_End_workflow_Notebooks
- title: Miscellaneous Notebooks
url: /docs/pages/tutorials/Miscellaneous_Notebooks
children:
Expand All @@ -150,6 +148,10 @@ tutorials:
url: /docs/pages/tutorials/misc/different_report_formats
- title: Editing Testcases
url: /docs/pages/tutorials/misc/editing-testcases
- title: Benchmark Dataset Notebooks
url: /docs/pages/tutorials/Benchmark_Dataset_Notebook_Notebooks
- title: End-to-End Workflow Notebooks
url: /docs/pages/tutorials/End_to_End_workflow_Notebooks

tests:
- title: Tests
Expand Down Expand Up @@ -190,50 +192,63 @@ tests:
url: /docs/pages/tests/ideology

benchmarks:
- title: Benchmarks
url: /docs/pages/benchmarks/benchmark
- title: Medical
url: /docs/pages/benchmarks/medical
children:
- title: MedMCQA
url: /docs/pages/benchmarks/medical/medmcqa
- title: MedQA
url: /docs/pages/benchmarks/medical/medqa
- title: PubMedQA
url: /docs/pages/benchmarks/medical/pubmedqa
- title: Commonsense Scenario
url: /docs/pages/benchmarks/commonsense_scenario
children:
- title: ASDiv
url: /docs/pages/benchmarks/asdiv
- title: BBQ
url: /docs/pages/benchmarks/bbq
- title: Bigbench
url: /docs/pages/benchmarks/bigbench
- title: BoolQ
url: /docs/pages/benchmarks/boolq
- title: CommonsenseQA
url: /docs/pages/benchmarks/commonsenseqa
- title: FIQA
url: /docs/pages/benchmarks/fiqa
url: /docs/pages/benchmarks/commonsense_scenario/commonsenseqa
- title: HellaSwag
url: /docs/pages/benchmarks/hellaswag
- title: Consumer-Contracts
url: /docs/pages/benchmarks/consumer-contracts
url: /docs/pages/benchmarks/commonsense_scenario/hellaswag
- title: OpenBookQA
url: /docs/pages/benchmarks/commonsense_scenario/openbookqa
- title: PIQA
url: /docs/pages/benchmarks/commonsense_scenario/piqa
- title: SIQA
url: /docs/pages/benchmarks/commonsense_scenario/siqa
- title : Legal
url: /docs/pages/benchmarks/legal
children:
- title: Contracts
url: /docs/pages/benchmarks/contracts
url: /docs/pages/benchmarks/legal/contracts
- title: Consumer-Contracts
url: /docs/pages/benchmarks/legal/consumer-contracts
- title: Privacy-Policy
url: /docs/pages/benchmarks/privacy-policy
url: /docs/pages/benchmarks/legal/privacy-policy
- title: FIQA
url: /docs/pages/benchmarks/legal/fiqa
- title: MultiLexSum
url: /docs/pages/benchmarks/legal/multilexsum
- title: Other Benchmarks
url: /docs/pages/benchmarks/other_benchmarks
children:
- title: ASDiv
url: /docs/pages/benchmarks/other_benchmarks/asdiv
- title: BBQ
url: /docs/pages/benchmarks/other_benchmarks/bbq
- title: Bigbench
url: /docs/pages/benchmarks/other_benchmarks/bigbench
- title: BoolQ
url: /docs/pages/benchmarks/other_benchmarks/boolq
- title: LogiQA
url: /docs/pages/benchmarks/logiqa
url: /docs/pages/benchmarks/other_benchmark/logiqa
- title: MMLU
url: /docs/pages/benchmarks/mmlu
- title: MultiLexSum
url: /docs/pages/benchmarks/multilexsum
url: /docs/pages/benchmarks/other_benchmarks/mmlu
- title: NarrativeQA
url: /docs/pages/benchmarks/narrativeqa
url: /docs/pages/benchmarks/other_benchmarks/narrativeqa
- title: NQ-open
url: /docs/pages/benchmarks/nq-open
- title: OpenBookQA
url: /docs/pages/benchmarks/openbookqa
- title: PIQA
url: /docs/pages/benchmarks/piqa
url: /docs/pages/benchmarks/other_benchmarks/nq-open
- title: Quac
url: /docs/pages/benchmarks/quac
- title: SIQA
url: /docs/pages/benchmarks/siqa
url: /docs/pages/benchmarks/other_benchmarks/quac
- title: TruthfulQA
url: /docs/pages/benchmarks/truthfulqa
url: /docs/pages/benchmarks/other_benchmarks/truthfulqa
- title: XSum
url: /docs/pages/benchmarks/xsum


url: /docs/pages/benchmarks/other_benchmarks/xsum
Binary file added docs/assets/images/benchmarks/medmcq.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/benchmarks/medqa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/benchmarks/mmlu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/benchmarks/openbookqa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading