Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
118 commits
Select commit Hold shift + click to select a range
c2b1393
Refactor NERSample.is_pass() to handle cases where either aligned spa…
chakravarthik27 Jul 25, 2024
ea48e1f
format issues
chakravarthik27 Jul 25, 2024
beb0d31
Merge pull request #1080 from JohnSnowLabs/bug/nersample-transformati…
chakravarthik27 Jul 25, 2024
52b81f2
resolved: recovering the transformation object.
chakravarthik27 Aug 5, 2024
1c0112e
removed the unused imports
chakravarthik27 Aug 5, 2024
3229bd2
chore: Recover transformation object and apply to NER task test cases
chakravarthik27 Aug 5, 2024
e42f282
resolved: unknown args
chakravarthik27 Aug 5, 2024
f195d1f
chore: Refactor CSVDataset to handle missing or invalid transformations
chakravarthik27 Aug 5, 2024
d155f20
Merge pull request #1081 from JohnSnowLabs/bug/nersample-transformati…
chakravarthik27 Aug 5, 2024
82f8b87
fixed: consistent issues while generated templates in templatic augme…
chakravarthik27 Aug 13, 2024
a47156d
resolved: lint and format issues.
chakravarthik27 Aug 13, 2024
92cd12c
fixed: transformed and add export types are supported in DataAugumenter
chakravarthik27 Aug 14, 2024
712d4d6
fixed: inplace method in DataAugmenter with proper proportion.
chakravarthik27 Aug 14, 2024
fd2333d
update doc strings and remove the print statements.
chakravarthik27 Aug 14, 2024
fa7e3cf
Merge pull request #1085 from JohnSnowLabs/fix/augmentation-config-va…
chakravarthik27 Aug 14, 2024
526ae6f
chore: generate additional templates in TemplaticAugment as user choi…
chakravarthik27 Aug 15, 2024
8185a90
chore: remove quotes in generated template and self check the num_ext…
chakravarthik27 Aug 15, 2024
f377ff0
Merge pull request #1089 from JohnSnowLabs/fix/augmentation-config-va…
chakravarthik27 Aug 15, 2024
6734f70
chore: Fix error message in Augmentation when generating templates
chakravarthik27 Aug 16, 2024
55d17e1
chore: Refactor DataAugmenter to improve template generation and prop…
chakravarthik27 Aug 16, 2024
b7f68c1
Refactor DataAugmenter to improve proportion handling
chakravarthik27 Aug 16, 2024
64475eb
Merge pull request #1090 from JohnSnowLabs/fix/augmentation-config-va…
chakravarthik27 Aug 16, 2024
24be4be
Refactor TemplaticAugment to support multiple AI providers for templa…
chakravarthik27 Aug 26, 2024
4d866f2
Integrated Azure OpenAI and OpenAI services for automated template ge…
chakravarthik27 Aug 27, 2024
d04d500
added comment for "azoi means Azue OpenAI"
chakravarthik27 Aug 27, 2024
29d136e
updated the model_config handling.
chakravarthik27 Aug 28, 2024
cccb562
changed: logging to logger from langtest
chakravarthik27 Aug 29, 2024
85d7e70
added: doc lines
chakravarthik27 Sep 2, 2024
cc821c9
Merge pull request #1091 from JohnSnowLabs/fix/augmentation-config-va…
chakravarthik27 Sep 2, 2024
6887233
implemented: basic version to handling document wise.
chakravarthik27 Sep 2, 2024
df7776e
implemented: text-classification support for multi-label classification.
chakravarthik27 Sep 3, 2024
2da96b7
Refactor SequenceClassificationOutputFormatter to handle multi-label …
chakravarthik27 Sep 3, 2024
16fee46
Refactor CSVDataset to remove unnecessary transformation field
chakravarthik27 Sep 3, 2024
beec9c3
feat: Add pos_tag and chunk_tag to ConllDataset token creation in doc…
chakravarthik27 Sep 3, 2024
258a0f7
fixed: Unbound Error and Key Error.
chakravarthik27 Sep 3, 2024
23eb0c3
Merge pull request #1096 from JohnSnowLabs/feature/add-support-for-th…
chakravarthik27 Sep 3, 2024
f2f3cc0
Merge pull request #1097 from JohnSnowLabs/patch/2.3.1
ArshaanNazir Sep 4, 2024
72eba73
chore: update pyproject.toml version to 2.3.1
chakravarthik27 Sep 4, 2024
5b1c284
Merge pull request #1098 from JohnSnowLabs/patch/2.3.1
chakravarthik27 Sep 4, 2024
dd588cb
chore: update DataAugmenter to support generating JSON output for NER…
chakravarthik27 Sep 9, 2024
12934cd
Refactor: the save method in the DataAugmenter class to handle the fi…
chakravarthik27 Sep 9, 2024
97bee72
Merge pull request #1100 from JohnSnowLabs/feature/add-json-output-fo…
chakravarthik27 Sep 9, 2024
78cb31f
Merge pull request #1101 from JohnSnowLabs/patch/2.3.1
chakravarthik27 Sep 9, 2024
2eb72a3
Refactor ConllDataset token creation in doc_wise
chakravarthik27 Sep 10, 2024
aef1352
Refactor NEROutputFormatter to add newline character after each sentence
chakravarthik27 Sep 10, 2024
6a8aae3
fixed: linting issues
chakravarthik27 Sep 10, 2024
3111017
Refactor NEROutputFormatter to handle newline characters in sample pr…
chakravarthik27 Sep 10, 2024
6759417
fixed: issue with `doc_wise` parameter for another task.
chakravarthik27 Sep 10, 2024
84040b6
fixed: doc_wise issue in harness import_testcases method.
chakravarthik27 Sep 11, 2024
3927b24
Merge remote-tracking branch 'origin/patch/2.3.1' into enhance/docume…
chakravarthik27 Sep 11, 2024
5598359
fixed: module error while importing harness.
chakravarthik27 Sep 11, 2024
e312032
updated: build and test support for patch branches
chakravarthik27 Sep 11, 2024
f1fbdc1
Merge pull request #1094 from JohnSnowLabs/enhance/document-wise-data…
chakravarthik27 Sep 11, 2024
4448c71
Merge pull request #1102 from JohnSnowLabs/fix/module_error-with-open…
chakravarthik27 Sep 11, 2024
134de82
Refactor split method in robustness.py to split on space character ex…
chakravarthik27 Sep 11, 2024
b35c28a
Merge pull request #1103 from JohnSnowLabs/patch/2.3.1
chakravarthik27 Sep 11, 2024
f331b69
Added: implemeted the breaking sentence by newline in robustness.
chakravarthik27 Sep 14, 2024
f274765
refactor the add_new_lines and while random selection of number of ne…
chakravarthik27 Sep 14, 2024
0414f71
parameter: number_of_lines -> max_lines.
chakravarthik27 Sep 14, 2024
a3986b4
Merge pull request #1109 from JohnSnowLabs/feature/implement-the-addn…
chakravarthik27 Sep 14, 2024
3160b1d
Implemented the add_tabs test in robustness category
chakravarthik27 Sep 14, 2024
8179145
Merge remote-tracking branch 'origin/release/2.4.0' into feature/impl…
chakravarthik27 Sep 14, 2024
c8a9511
implemented: basic structured to handle visualQA
chakravarthik27 Sep 14, 2024
f7b53e6
Refactor VisualQASample class to include additional attributes and do…
chakravarthik27 Sep 14, 2024
6eec7ca
Refactor llm_modelhandler.py to include PretrainedModelForVisualQA class
chakravarthik27 Sep 14, 2024
b95ecf3
Refactor VisualQA class to fix typo in base class name
chakravarthik27 Sep 14, 2024
ca2f9d6
Merge pull request #1110 from JohnSnowLabs/feature/implement-the-addt…
chakravarthik27 Sep 15, 2024
adf18db
Merge remote-tracking branch 'origin/release/2.4.0' into feature/impl…
chakravarthik27 Sep 15, 2024
d3e6fa5
updated: image handling while loading dataset.
chakravarthik27 Sep 15, 2024
3ee5f8f
implemented the different tests under robusntess category and support…
chakravarthik27 Sep 15, 2024
3dd6770
Refactor image handling in robustness tests
chakravarthik27 Sep 15, 2024
d95e558
Refactor image handling in robustness tests and add support for multi…
chakravarthik27 Sep 15, 2024
ebd7bfd
Refactor image handling in robustness tests and update VisualQASample…
chakravarthik27 Sep 15, 2024
4538490
Refactor image handling in robustness tests and exclude image-related…
chakravarthik27 Sep 15, 2024
41f0db2
fixed: format issues.
chakravarthik27 Sep 15, 2024
3521927
Refactor image handling in robustness tests and remove commented code
chakravarthik27 Sep 16, 2024
a87e96c
Refactor image handling in robustness tests and update VisualQASample…
chakravarthik27 Sep 16, 2024
04e18e3
- added new tests in image robustness.
chakravarthik27 Sep 16, 2024
8039ef8
Add pillow library to pyproject.toml
chakravarthik27 Sep 16, 2024
febf855
Update transformers version to 4.44.2
chakravarthik27 Sep 16, 2024
101305a
Update transformers version to 4.43.1
chakravarthik27 Sep 16, 2024
96cc4f1
Update pyproject.toml to force CPU installation of torch
chakravarthik27 Sep 16, 2024
d64312d
Update accelerate version to 0.22.0
chakravarthik27 Sep 16, 2024
4780cf0
Update accelerate version to 0.33.0 and pyproject.toml to force CPU i…
chakravarthik27 Sep 16, 2024
0c7c9b0
Now handles the multi-label in accuracy tests.
chakravarthik27 Sep 16, 2024
54f235d
Refactor accuracy tests to handle multi-label classification
chakravarthik27 Sep 16, 2024
a04eba6
Update mlflow version to 2.16.1 and add openpyxl and tables dependencies
chakravarthik27 Sep 16, 2024
9f7f73e
Merge pull request #1114 from JohnSnowLabs/fix/error-in-accuracy-test…
chakravarthik27 Sep 16, 2024
ac652cf
Update pydantic version to 1.10.11
chakravarthik27 Sep 17, 2024
2d0f0d8
Update transformers version to 4.44.2 and mlflow version to 2.16.2
chakravarthik27 Sep 17, 2024
3745e6a
Refactor calculate_f1_score function to handle different types of y_t…
chakravarthik27 Sep 17, 2024
bcdfc92
formatted.
chakravarthik27 Sep 17, 2024
b0a1a26
Merge pull request #1116 from JohnSnowLabs/fix/error-in-accuracy-test…
chakravarthik27 Sep 17, 2024
d3a4663
Merge pull request #1112 from JohnSnowLabs/update/fixing-security-issues
chakravarthik27 Sep 17, 2024
a5ae26a
Merge remote-tracking branch 'origin/release/2.4.0' into feature/impl…
chakravarthik27 Sep 17, 2024
10aa4b3
Refactor security.py to add new security checks
chakravarthik27 Sep 17, 2024
b29f9dd
resolve OutofMemory issues
chakravarthik27 Sep 17, 2024
16a3aa5
updated the notebook
chakravarthik27 Sep 17, 2024
b337d2b
Update pillow version to 10.0.0 and make it a required dependency
chakravarthik27 Sep 17, 2024
67c641d
Merge pull request #1111 from JohnSnowLabs/feature/implement-the-supp…
chakravarthik27 Sep 17, 2024
62b77b1
Refactor typing imports in accuracy.py and safety.py
chakravarthik27 Sep 18, 2024
409cb96
Refactor prepare_model_response method to handle multi-label classifi…
chakravarthik27 Sep 18, 2024
d98a9d3
fixed: circular import errors
chakravarthik27 Sep 18, 2024
7a58067
Refactor test type in safety.py and add decimal formatting in output.py
chakravarthik27 Sep 18, 2024
5e482e1
Refactor multi-label handling in TestResultManager
chakravarthik27 Sep 18, 2024
e9c54e9
fixed: formatted issue
chakravarthik27 Sep 18, 2024
4664bbf
Merge pull request #1118 from JohnSnowLabs/fix/error-in-accuracy-test…
chakravarthik27 Sep 18, 2024
a90c932
Refactor PromptGuard class and related modules
chakravarthik27 Sep 19, 2024
092b3e9
Refactor fairness test to handle multi-label classification in text c…
chakravarthik27 Sep 19, 2024
f362a62
fixed: format and liniting issues.
chakravarthik27 Sep 19, 2024
7e2b232
Merge pull request #1121 from JohnSnowLabs/fix/error-in-fairness-test…
chakravarthik27 Sep 19, 2024
d89477a
Merge pull request #1119 from JohnSnowLabs/feature/enhance-security-t…
chakravarthik27 Sep 19, 2024
da9f58b
Refactor security.py: Remove unused classes and methods
chakravarthik27 Sep 19, 2024
90e902f
update version to 2.4.0 in pyproject.toml for release
chakravarthik27 Sep 20, 2024
551cc12
jailbreak and injection tests supports for text-classification.
chakravarthik27 Sep 20, 2024
1b9c7db
Merge pull request #1122 from JohnSnowLabs/release/2.4.0
chakravarthik27 Sep 22, 2024
a2580ba
Add new tests for text classification and prompt evaluation tutorials
chakravarthik27 Sep 23, 2024
833bbaf
Merge pull request #1123 from JohnSnowLabs/chore/final_website_updates
chakravarthik27 Sep 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ on:
pull_request:
branches:
- "release/*"
- "patch/*"
- "main"

jobs:
Expand Down
1 change: 1 addition & 0 deletions demo/tutorials/llm_notebooks/Visual_QA.ipynb

Large diffs are not rendered by default.

721 changes: 721 additions & 0 deletions demo/tutorials/misc/Add_New_Lines_and_Tabs_Tests.ipynb

Large diffs are not rendered by default.

517 changes: 517 additions & 0 deletions demo/tutorials/misc/Safety_Tests_With_PromptGuard.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,7 @@ The following table gives an overview of the different tutorial notebooks. In th
| **Multi-Dataset Prompt Configs**: In this Notebook, we discussed about optimized prompt handling for multiple datasets, allowing users to add custom prompts for each dataset, enabling seamless integration and efficient testing. | OpenAI |Question-Answering | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/MultiPrompt_MultiDataset.ipynb) |
| **Multi-Model, Multi-Dataset**: In this Notebook, we discussed about testing on multiple models with multiple datasets, allowing users to allows for comprehensive comparisons and performance assessments in a streamlined manner. | OpenAI |Question-Answering | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Multi_Model_Multi_Dataset.ipynb) |
| **Evaluation_with_Prometheus_Eval**: In this Notebook, we disscussed about integrating the Prometheus model to langtest brings enhanced evaluation capabilities, providing more detailed and insightful metrics for model performance assessment. | OpenAI |Question-Answering | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Evaluation_with_Prometheus_Eval.ipynb) |
| **Misuse_Test_with_Prometheus_evaluation**: In this Notebook, we discussed about new safety testing features to identify and mitigate potential misuse and safety issues in your models | OpenAI |Question-Answering | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Misuse_Test_with_Prometheus_evaluation.ipynb) |
| **Misuse_Test_with_Prometheus_evaluation**: In this Notebook, we discussed about new safety testing features to identify and mitigate potential misuse and safety issues in your models | OpenAI |Question-Answering | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Misuse_Test_with_Prometheus_evaluation.ipynb) |
| **Visual_QA**: In this Notebook, we discussed about the visual question answering tests to evaluate how models handle both visual and textual inputs, offering a deeper understanding of their versatility. | OpenAI | Visual-Question-Answering (visualqa) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Misuse_Test_with_Prometheus_evaluation.ipynb) |
| **Add_New_Lines_and_Tabs_Tests**: In this Notebook, we discussed about new tests like inserting new lines and tab characters into text inputs, challenging your models to handle structural changes without compromising accuracy. | Hugging Face/John Snow Labs/Spacy |Text-Classification/Question-Answering/Summarization | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Add_New_Lines_and_Tabs_Tests.ipynb) |
| **Safety_Tests_With_PromptGuard**: In this Notebook, we discussed about evaluating prompts before they are sent to large language models (LLMs), ensuring harmful or unethical outputs are avoided with PromptGuard. | Hugging Face/John Snow Labs/Spacy | Text-Classification/Question-Answering/Summarization | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Safety_Tests_With_PromptGuard.ipynb) |
247 changes: 186 additions & 61 deletions langtest/augmentation/augmenter.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,22 @@
from collections import defaultdict
import random
import yaml
import pandas as pd

from typing import Any, Dict, Iterable, Union
from typing import Any, Dict, Iterable, List, Union
from langtest.datahandler.datasource import DataFactory
from langtest.transform import TestFactory
from langtest.tasks.task import TaskManager
from langtest.utils.custom_types.sample import Sample
from langtest.logger import logger


class DataAugmenter:
def __init__(self, task: Union[str, TaskManager], config: Union[str, dict]) -> None:
def __init__(
self,
task: Union[str, TaskManager],
config: Union[str, dict],
) -> None:
"""
Initialize the DataAugmenter.

Expand All @@ -23,7 +31,7 @@ def __init__(self, task: Union[str, TaskManager], config: Union[str, dict]) -> N
if isinstance(config, str):
self.__config = self.load_config(config)

self.__tests: dict = self.__config.get("tests", [])
self.__tests: Dict[str, Dict[str, dict]] = self.__config.get("tests", [])
if isinstance(task, str):
if task in ["ner", "text-classification", "question-answering"]:
task = TaskManager(task)
Expand All @@ -40,14 +48,12 @@ def __init__(self, task: Union[str, TaskManager], config: Union[str, dict]) -> N
self.__testfactory.is_augment = True

# parameters
self.__max_proportion = self.__tests.get("defaults", {}).get(
"max_proportion", 0.6
)
self.__max_data_limit = self.__tests.get("parameters", {}).get("max_limit", 0.5)
# self.__ntests = len(v for k, v in self.__tests.items()) - 1
self.__type = self.__config.get("parameters", {}).get("type", "proportion")
self.__style = self.__config.get("parameters", {}).get("style", "extend")

self.__df_config = self.__config_df()
self.__df_config = self.__initialize_config_df()

def load_config(self, config: str) -> dict:
"""
Expand All @@ -61,93 +67,199 @@ def augment(self, data: Union[str, Iterable]) -> str:
Augment the content.
"""
# load the data
if isinstance(data, dict):
if isinstance(data, dict) and not isinstance(self.__datafactory, DataFactory):
self.__datafactory = self.__datafactory(file_path=data, task=self.__task)

data = self.__datafactory.load()
elif isinstance(self.__datafactory, DataFactory):
data = self.__datafactory.load()

# generate the augmented data
test_cases = self.__testfactory.transform(self.__task, data, self.__tests)

# check the style of augmentation to be applied. Default is extend
if self.__style == "extend":
self.extend(data)
if self.__style == "extend" or self.__style == "add":
self.extend(data, test_cases)
elif self.__style == "inplace":
self.inplace(data)
elif self.__style == "new":
self.new_data(data)
self.inplace(data, test_cases)
elif self.__style == "new" or self.__style == "transformed":
self.new_data(data, test_cases)
else:
raise ValueError("Invalid style")

return self

def extend(self, data: Iterable) -> "DataAugmenter":
def extend(self, data: Iterable, testcases: Iterable[Sample]) -> "DataAugmenter":
"""
Extend the content.
"""
# calculate the number of rows to be added
n = len(data)

data_cut = random.sample(data, int(n * self.__max_proportion))

test_cases: list = self.__testfactory.transform(
self.__task, data_cut, self.__tests
)

self.__augmented_data = [*data, *test_cases] if isinstance(data, list) else data
# arrange the test cases based on the test_type in a dictionary
test_cases = defaultdict(list)
for sample in testcases:
if sample.test_type in test_cases:
test_cases[sample.test_type].append(sample)
else:
test_cases[sample.test_type] = [sample]

final_data = []
# pick the test cases based on the allocated size of the test_type
for _, tests in self.__tests.items():
for test_name, _ in tests.items():
size = self.allocated_size(test_name)

if size == 0:
continue

temp_test_cases = test_cases.get(test_name, [])
if temp_test_cases:
# select random rows based on the size
temp_test_cases = (
random.choices(temp_test_cases, k=size)
if size < len(temp_test_cases)
else temp_test_cases
)
final_data.extend(temp_test_cases)

# append the augmented data to the original data
self.__augmented_data = [*data, *final_data] if isinstance(data, list) else data

return self

def inplace(self, data: Iterable) -> "DataAugmenter":
def inplace(self, data: Iterable, testcases: Iterable[Sample]) -> "DataAugmenter":
"""
Inplace augmentation.
"""
# calculate the number of rows to be added
size = int(len(data) * self.__max_proportion)

# create a dictionary with index as key and data as value
# indices of the data and the data itself
data_indices = self.prepare_hash_map(data, inverted=True)
data_dict = self.prepare_hash_map(data)

# select random rows based on the size with its index
selected = random.sample(data_dict.keys(), int(size))

for idx in selected:
test_cases = self.__testfactory.transform(
self.__task, [data_dict[idx]], self.__tests
# arrange the test cases based on the test type in a dictionary
test_cases = defaultdict(list)
for sample in testcases:
if sample.test_type in test_cases:
test_cases[sample.test_type].append(sample)
else:
test_cases[sample.test_type] = [sample]

# pick the test cases based on the allocated size of the test_type
final_data: List[Sample] = []
for _, tests in self.__tests.items():
for test_name, _ in tests.items():
size = self.allocated_size(test_name)

if size == 0:
continue

temp_test_cases = test_cases.get(test_name, [])
if temp_test_cases:
# select random rows based on the size
temp_test_cases = (
random.choices(temp_test_cases, k=size)
if size < len(temp_test_cases)
else temp_test_cases
)
final_data.extend(temp_test_cases)

# replace the original data with the augmented data in extact position.
for sample in final_data:
key = (
sample.original_question
if hasattr(sample, "original_question")
else sample.original
)
data_dict[idx] = test_cases[0] if test_cases else data_dict[idx]
index = data_indices[key]
data_dict[index] = sample

self.__augmented_data = data_dict.values()

return self

def new_data(self, data: Iterable) -> "DataAugmenter":
def new_data(self, data: Iterable, testcases: Iterable[Sample]) -> "DataAugmenter":
"""
Create new data.
"""
# calculate the number of rows to be added
size = int(len(data) * self.__max_proportion)
# arrange the test cases based on the test type in a dictionary
test_cases = defaultdict(list)
for sample in testcases:
if sample.test_type in test_cases:
test_cases[sample.test_type].append(sample)
else:
test_cases[sample.test_type] = [sample]

final_data = []

# pick the test cases based on the allocated size of the test_type
for _, tests in self.__tests.items():
for test_name, _ in tests.items():
size = self.allocated_size(test_name)

data_cut = random.sample(data, size)
if size == 0:
continue

test_cases = self.__testfactory.transform(self.__task, data_cut, self.__tests)
temp_test_cases = test_cases.get(test_name, [])
if temp_test_cases:
# select random rows based on the size
temp_test_cases = (
random.choices(temp_test_cases, k=size)
if size < len(temp_test_cases)
else temp_test_cases
)
final_data.extend(temp_test_cases)

self.__augmented_data = test_cases
# replace the original data with the augmented data
self.__augmented_data = final_data

return self

def size(self, category: str, test_name: str) -> int:
return (
self.__max_proportion
* self.__tests.get(category, {}).get(test_name, {}).get("max_proportion", 0.6)
) / self.__df_config.shape[0]
def allocated_size(self, test_name: str) -> int:
"""allocation size of the test to be augmented"""

def prepare_hash_map(self, data: Union[str, Iterable]) -> Dict[str, Any]:
hashmap = {index: sample for index, sample in enumerate(data)}
try:
max_data_limit = (
len(self.__datafactory)
* self.__max_data_limit
* self.__df_config.loc[test_name, "avg_proportion"]
)

return int(
max_data_limit * self.__df_config.loc[test_name, "normalized_proportion"]
)
except AttributeError:
raise ValueError(
"Dataset is not loaded. please load the data using the `DataAugmenter.augment(data={'data_source': '..'})` method"
)

def prepare_hash_map(
self, data: Union[Iterable[Sample], Sample], inverted=False
) -> Dict[str, Any]:
if inverted:
hashmap = {}
for index, sample in enumerate(data):
key = (
sample.original_question
if hasattr(sample, "original_question")
else sample.original
)
hashmap[key] = index
else:
hashmap = {index: sample for index, sample in enumerate(data)}

return hashmap

def save(self, file_path: str):
def save(self, file_path: str, for_gen_ai=False) -> None:
"""
Save the augmented data.
"""
self.__datafactory.export(data=self.__augmented_data, output_path=file_path)
try:
# .json file allow only for_gen_ai boolean is true and task is ner
# then file_path should be .json
if not (for_gen_ai) and self.__task.task_name == "ner":
if file_path.endswith(".json"):
raise ValueError("File path shouldn't be .json file")

self.__datafactory.export(data=self.__augmented_data, output_path=file_path)
except Exception as e:
logger.error(f"Error in saving the augmented data: {e}")

def __or__(self, other: Iterable):
results = self.augment(other)
Expand All @@ -157,28 +269,41 @@ def __ror__(self, other: Iterable):
results = self.augment(other)
return results

def __config_df(self):
def __initialize_config_df(self) -> pd.DataFrame:
"""
Configure the data frame.
"""

import pandas as pd

df = pd.DataFrame(columns=["category", "test_name", "proportion"])

# read the configuration
temp_data = []
for category, tests in self.__tests.items():
if category not in ["robustness", "bias"]:
continue
for test_name, test in tests.items():
proportion = test.get("max_proportion", 0.6)
temp = pd.DataFrame(
proportion = test.get("max_proportion", 0.2)
temp_data.append(
{
"category": [category],
"test_name": [test_name],
"proportion": [proportion],
},
"category": category,
"test_name": test_name,
"proportion": proportion,
}
)
df = pd.concat([df, temp], ignore_index=True)
df = pd.concat([df, pd.DataFrame(temp_data)], ignore_index=True)

# Convert 'proportion' column to float
df["proportion"] = pd.to_numeric(df["proportion"], errors="coerce")

# normalize the proportion and round it to 2 decimal places
df["normalized_proportion"] = df["proportion"] / df["proportion"].sum()
df["normalized_proportion"] = df["normalized_proportion"].apply(
lambda x: round(x, 2)
)

df["avg_proportion"] = df["proportion"].mean(numeric_only=True).round(2)

# set the index as test_name
df.set_index("test_name", inplace=True)

return df
Loading