diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index d4385934..e98ab3cb 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -23,6 +23,7 @@ jobs: pip install -r requirements.txt pip install -r requirements-test.txt pip install -r requirements-evaluation.txt + pip install -r requirements-roberta.txt - name: Lint with flake8 run: | # stop the build if there are Python syntax errors or undefined names diff --git a/requirements-roberta.txt b/requirements-roberta.txt new file mode 100644 index 00000000..287d83b5 --- /dev/null +++ b/requirements-roberta.txt @@ -0,0 +1,6 @@ +tqdm==4.49.0 +scikit-learn~=0.24.2 +transformers==4.6.1 +tokenizers==0.10.2 +torch==1.8.1 +wandb==0.10.31 \ No newline at end of file diff --git a/src/python/evaluation/qodana/imitation_model/README.md b/src/python/evaluation/qodana/imitation_model/README.md new file mode 100644 index 00000000..afb43ad3 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/README.md @@ -0,0 +1,118 @@ +# Qodana imitation model +## Description +The general purpose of the model is to simulate the behavior of [`Qodana`](https://github.com/JetBrains/Qodana/tree/main) – +a code quality monitoring tool that identifies and suggests fixes for bugs, security vulnerabilities, duplications, and imperfections. + +Motivation for developing a model: +- acceleration of the code analysis process by training the model to recognize a certain class of errors; +- the ability to run the model on separate files without the need to create a project (for example, for the Java language) + + +## Architecture +[`RobertaForSequenceClassification`](https://huggingface.co/transformers/model_doc/roberta.html#robertaforsequenceclassification) model with [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html) solve multilabel classification task. + +Model outputs is a tensor of size: `batch_size` x `num_classes`. Where `batch_size` is the number of training examples utilized in one iteration, +and `num_classes` is the number of error types met in the dataset. By model class here, we mean a unique error type. +Class probabilities are received by taking `sigmoid` and final predictions are computed by comparing the probability of each class with the `threshold`. + +As classes might be unbalanced the used metric is `f1-score`. +## What it does + +Model has two use cases: +- It can be trained to predict a unique number of errors in a **block** of code, unfixed length. + +**Example**: + +code | inspections +--- | --- +|`import java.util.Scanner; class Main {public static void main(String[] args) {Scanner scanner = new Scanner(System.in);// put your code here int num = scanner.nextInt(); System.out.println((num / 10 ) % 10);}}`| 1, 2| + + +- It can be trained to predict a unique number of errors in a **line** of code. + +**Example** + +code | inspections +--- | --- +|`import java.util.Scanner;`| 0| +|`\n`|0| +|`class Main {`|1| +|`public static void main(String[] args`) {|1| +|`Scanner scanner = new Scanner(System.in);`|0| +|`// put your code here`|0| +|`int num = scanner.nextInt();`|0| +|`System.out.println((num / 10 ) % 10);`|2| +|`}`|0| +|`}`|0| + + +## Data preprocessing + +Please address to the [`following documentation`](src/python/evaluation/qodana) for labeling dataset and to the [`following documentation`](preprocessing) to preprocess data for model training and evaluation afterwards. + +After completing the 3d preprocessing step you should have 3 folders: +`train`, `val`, `test` with `train.csv`, `val.csv` and `test.csv` respectively. + +Each file has the same structure, it should consist of 4+ columns: +- `id` – solutions id; +- `code` – line od code or block of code; +- `lang` - language version; +- `0`, `1`, `2` ... `n` – several columns, equal to the unique number of errors detected by Qodana in the dataset. +The values in the columns are binary numbers: `1` if inspection is detected and `0` otherwise. + + +## How to train the model + +Run [`train.py`](train.py) script from the command line with the following arguments: + +Required arguments: + +- `train_dataset_path` ‑ path to the `train.csv` – file that consists of samples +that model will use for training. + +- `val_dataset_path` ‑ path to the `val.csv` – file that consists of samples +that model will use for evaluation during training. + +Both files are received by running [`split_dataset.py`](preprocessing/split_dataset.py) script and has the structure as described above. + +Optional arguments: + +Argument | Description +--- | --- +|**‑o**, **‑‑output_directory_path**| Path to the directory where model weights will be saved. If not set, folder will be created in the `train` folder where `train.csv` dataset is stored.| +|**‑c**, **‑‑context_length**| Sequence length or embedding size of tokenized samples. Available values are any `positive integers`. **Default is 40**.| +|**‑e**, **‑‑epoch**| Number of epochs to train model. **Default is 2**.| +|**‑bs**, **‑‑batch_size**| Batch size for training and validation dataset. Available values are any `positive integers`. **Default is 16**.| +|**‑lr**, **‑‑learning_rate**| Optimizer learning rate. **Default is 2e-5**.| +|**‑w**, **‑‑weight_decay**| Weight decay parameter for an optimizer. **Default is 0.01**.| +|**‑th**, **‑‑threshold**| Is used to compute predictions. Available values: 0 < `threshold` < 1. If the probability of inspection is greater than `threshold`, sample will be classified with the inspection. **Default is 0.5**.| +|**‑ws**, **‑‑warm_up_steps**| A number of steps when optimizer uses constant learning rate before applying scheduler policy. **Default is 300**.| +|**‑sl**, **‑‑save_limit**| Total amount of checkpoints limit. Default is 1.| + +To inspect the rest of default training parameters please, address to the [`TrainingArguments`](common/train_config.py). + +## How to evaluate model + +Run [`evaluation.py`](evaluation.py) script from the command line with the following arguments: + +Required arguments: + +`test_dataset_path` ‑ path to the `test.csv` received by running [`split_dataset.py`](preprocessing/split_dataset.py) script. + +`model_weights_directory_path` ‑ path to the folder where trained model weights are saved. + +Optional arguments: + +Argument | Description +--- | --- +|**‑o**, **‑‑output_directory_path**| Path to the directory where labeled dataset will be saved. Default is the `test` folder.| +|**‑c**, **‑‑context_length**| Sequence length or embedding size of tokenized samples. Available values are any `positive integers`. **Default is 40**.| +|**‑sf**, **‑‑save_f1_score**| If enabled report with f1 scores by classes will be saved to the `csv` file in the parent directory of labeled dataset. **Disabled by default**.| +|**‑bs**, **‑‑batch_size**| The number of training examples utilized in one training and validation iteration. Available values are any `positive integers`. **Default is 16**.| +|**‑th**, **‑‑threshold**| Is used to compute predictions. Available values: 0 < `threshold` < 1. If the probability of inspection is greater than `threshold`, sample will be classified with the inspection. **Default is 0.5**.| + +Output is a `predictions.csv` file with the column names matches the number of classes. Each sample has a binary label: + +- `0` ‑ if the model didn't found an error in a sample. + +- `1` ‑ if the error was found in a sample. diff --git a/src/python/evaluation/qodana/imitation_model/__init__.py b/src/python/evaluation/qodana/imitation_model/__init__.py new file mode 100644 index 00000000..0c4d7f8e --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/__init__.py @@ -0,0 +1,3 @@ +from src.python import MAIN_FOLDER + +MODEL_FOLDER = MAIN_FOLDER.parent / 'python/imitation_model' diff --git a/src/python/evaluation/qodana/imitation_model/common/__init__.py b/src/python/evaluation/qodana/imitation_model/common/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/python/evaluation/qodana/imitation_model/common/evaluation_config.py b/src/python/evaluation/qodana/imitation_model/common/evaluation_config.py new file mode 100644 index 00000000..e91bf687 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/common/evaluation_config.py @@ -0,0 +1,47 @@ +import argparse + +from src.python.evaluation.qodana.imitation_model.common.util import ModelCommonArgument +from src.python.review.common.file_system import Extension + + +def configure_arguments(parser: argparse.ArgumentParser) -> None: + parser.add_argument('test_dataset_path', + type=str, + help='Path to the dataset received by either' + f' src.python.evaluation.qodana.fragment_to_inspections_list{Extension.PY.value}' + 'or src.python.evaluation.qodana.fragment_to_inspections_list_line_by_line' + f'{Extension.PY.value}script.') + + parser.add_argument('model_weights_directory_path', + type=str, + help='Path to the directory where trained imitation_model weights are stored.') + + parser.add_argument('-o', '--output_directory_path', + default=None, + type=str, + help='Path to the directory where labeled dataset will be saved. Default is the parent folder' + 'of test_dataset_path.') + + parser.add_argument('-sf', '--save_f1_score', + default=None, + action="store_true", + help=f'If enabled report with f1 scores by class will be saved to the {Extension.CSV.value}' + ' File will be saved to the labeled dataset parent directory. Default is False.') + + parser.add_argument(ModelCommonArgument.CONTEXT_LENGTH.value.short_name, + ModelCommonArgument.CONTEXT_LENGTH.value.long_name, + type=int, + default=40, + help=ModelCommonArgument.CONTEXT_LENGTH.value.description) + + parser.add_argument(ModelCommonArgument.BATCH_SIZE.value.short_name, + ModelCommonArgument.BATCH_SIZE.value.long_name, + type=int, + default=8, + help=ModelCommonArgument.BATCH_SIZE.value.description) + + parser.add_argument(ModelCommonArgument.THRESHOLD.value.short_name, + ModelCommonArgument.THRESHOLD.value.long_name, + type=float, + default=0.5, + help=ModelCommonArgument.THRESHOLD.value.description) diff --git a/src/python/evaluation/qodana/imitation_model/common/metric.py b/src/python/evaluation/qodana/imitation_model/common/metric.py new file mode 100644 index 00000000..dce80a94 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/common/metric.py @@ -0,0 +1,41 @@ +import logging.config +from typing import Optional + +import torch +from sklearn.metrics import multilabel_confusion_matrix +from src.python.evaluation.qodana.imitation_model.common.util import MeasurerArgument + +logger = logging.getLogger(__name__) + + +class Measurer: + def __init__(self, threshold: float): + self.threshold = threshold + + def get_f1_score(self, predictions: torch.tensor, targets: torch.tensor) -> Optional[float]: + confusion_matrix = multilabel_confusion_matrix(targets, predictions) + false_positives = sum(score[0][1] for score in confusion_matrix) + false_negatives = sum(score[1][0] for score in confusion_matrix) + true_positives = sum(score[1][1] for score in confusion_matrix) + try: + f1_score = true_positives / (true_positives + 1 / 2 * (false_positives + false_negatives)) + return f1_score + except ZeroDivisionError: + logger.error("No values of the class present in the dataset.") + # return None to make it clear after printing what classes are missing in the datasets + return None + + def compute_metric(self, evaluation_predictions: torch.tensor) -> dict: + logits, targets = evaluation_predictions + prediction_probabilities = torch.from_numpy(logits).sigmoid() + predictions = torch.where(prediction_probabilities > self.threshold, 1, 0) + return {MeasurerArgument.F1_SCORE.value: self.get_f1_score(predictions, torch.tensor(targets))} + + def f1_score_by_classes(self, predictions: torch.tensor, targets: torch.tensor) -> dict: + unique_classes = range(len(targets[0])) + f1_scores_by_classes = {} + for unique_class in unique_classes: + class_mask = torch.where(targets[:, unique_class] == 1) + f1_scores_by_classes[str(unique_class)] = self.get_f1_score(predictions[class_mask[0], unique_class], + targets[class_mask[0], unique_class]) + return f1_scores_by_classes diff --git a/src/python/evaluation/qodana/imitation_model/common/train_config.py b/src/python/evaluation/qodana/imitation_model/common/train_config.py new file mode 100644 index 00000000..ba2d93fa --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/common/train_config.py @@ -0,0 +1,118 @@ +import argparse + +import torch +from src.python.evaluation.qodana.imitation_model.common.util import ( + DatasetColumnArgument, + ModelCommonArgument, + SeedArgument, +) +from transformers import Trainer, TrainingArguments + + +class MultilabelTrainer(Trainer): + """ By default RobertaForSequence classification does not support + multi-label classification. + + Target and logits tensors should be represented as torch.FloatTensor of shape (1,). + https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaForSequenceClassification + + To fine-tune the model for the multi-label classification task we can simply modify the trainer by + changing its loss function. https://huggingface.co/transformers/main_classes/trainer.html + """ + + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def compute_loss(self, model, inputs, return_outputs=False): + labels = inputs.pop(DatasetColumnArgument.LABELS.value) + outputs = model(**inputs) + logits = outputs.logits + loss_bce = torch.nn.BCEWithLogitsLoss() + loss = loss_bce(logits.view(-1, self.model.config.num_labels), + labels.float().view(-1, self.model.config.num_labels)) + + return (loss, outputs) if return_outputs else loss + + +def configure_arguments(parser: argparse.ArgumentParser) -> None: + parser.add_argument('train_dataset_path', + type=str, + help='Path to the train dataset.') + + parser.add_argument('val_dataset_path', + type=str, + help='Path to the dataset received by either') + + parser.add_argument('-wp', '--trained_weights_directory_path', + default=None, + type=str, + help='Path to the directory where to save imitation_model weights. Default is the directory' + 'where train dataset is.') + + parser.add_argument(ModelCommonArgument.CONTEXT_LENGTH.value.short_name, + ModelCommonArgument.CONTEXT_LENGTH.value.long_name, + type=int, + default=40, + help=ModelCommonArgument.CONTEXT_LENGTH.value.description) + + parser.add_argument(ModelCommonArgument.BATCH_SIZE.value.short_name, + ModelCommonArgument.BATCH_SIZE.value.long_name, + type=int, + default=16, + help=ModelCommonArgument.BATCH_SIZE.value.description) + + parser.add_argument(ModelCommonArgument.THRESHOLD.value.short_name, + ModelCommonArgument.THRESHOLD.value.long_name, + type=float, + default=0.5, + help=ModelCommonArgument.THRESHOLD.value.description) + + parser.add_argument('-lr', '--learning_rate', + type=int, + default=2e-5, + help='Learning rate.') + + parser.add_argument('-wd', '--weight_decay', + type=int, + default=0.01, + help='Wight decay parameter for optimizer.') + + parser.add_argument('-e', '--epoch', + type=int, + default=1, + help='Number of epochs to train imitation_model.') + + parser.add_argument('-ws', '--warm_up_steps', + type=int, + default=300, + help='Number of steps used for a linear warmup, default is 300.') + + parser.add_argument('-sl', '--save_limit', + type=int, + default=1, + help='Total amount of checkpoints limit. Default is 1.') + + +class TrainingArgs: + def __init__(self, args): + self.args = args + + def get_training_args(self, val_steps_to_be_made): + return TrainingArguments(num_train_epochs=self.args.epoch, + per_device_train_batch_size=self.args.batch_size, + per_device_eval_batch_size=self.args.batch_size, + learning_rate=self.args.learning_rate, + warmup_steps=self.args.warm_up_steps, + weight_decay=self.args.weight_decay, + save_total_limit=self.args.save_limit, + output_dir=self.args.trained_weights_directory_path, + overwrite_output_dir=True, + load_best_model_at_end=True, + greater_is_better=True, + save_steps=val_steps_to_be_made, + eval_steps=val_steps_to_be_made, + logging_steps=val_steps_to_be_made, + evaluation_strategy=DatasetColumnArgument.STEPS.value, + logging_strategy=DatasetColumnArgument.STEPS.value, + seed=SeedArgument.SEED.value, + report_to=[DatasetColumnArgument.WANDB.value]) diff --git a/src/python/evaluation/qodana/imitation_model/common/util.py b/src/python/evaluation/qodana/imitation_model/common/util.py new file mode 100644 index 00000000..da0d29e5 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/common/util.py @@ -0,0 +1,46 @@ +from enum import Enum, unique + +from src.python.common.tool_arguments import ArgumentsInfo + + +@unique +class DatasetColumnArgument(Enum): + ID = 'id' + IN_ID = 'inspection_id' + INSPECTIONS = 'inspections' + INPUT_IDS = 'input_ids' + LABELS = 'labels' + DATASET_PATH = 'dataset_path' + STEPS = 'steps' + WEIGHTS = 'weights' + WANDB = 'wandb' + + +@unique +class SeedArgument(Enum): + SEED = 42 + + +@unique +class CustomTokens(Enum): + NOC = '[NOC]' # no context token to add when there are no lines for the context + + +@unique +class ModelCommonArgument(Enum): + THRESHOLD = ArgumentsInfo('-th', '--threshold', + 'If the probability of inspection on code sample is greater than threshold,' + 'inspection id will be assigned to the sample. ' + 'Default is 0.5.') + + CONTEXT_LENGTH = ArgumentsInfo('-cl', '--context_length', + 'Sequence length of 1 sample after tokenization, default is 40.') + + BATCH_SIZE = ArgumentsInfo('-bs', '--batch_size', + 'Batch size – default values are 16 for training and 8 for evaluation mode.') + + +@unique +class MeasurerArgument(Enum): + F1_SCORE = 'f1_score' + F1_SCORES_BY_CLS = 'f1_scores_by_class' diff --git a/src/python/evaluation/qodana/imitation_model/dataset/__init__.py b/src/python/evaluation/qodana/imitation_model/dataset/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/python/evaluation/qodana/imitation_model/dataset/dataset.py b/src/python/evaluation/qodana/imitation_model/dataset/dataset.py new file mode 100644 index 00000000..088ce548 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/dataset/dataset.py @@ -0,0 +1,34 @@ +import logging + +import pandas as pd +import torch +from src.python.evaluation.common.util import ColumnName +from src.python.evaluation.qodana.imitation_model.common.util import DatasetColumnArgument +from torch.utils.data import Dataset +from transformers import RobertaTokenizer + +logger = logging.getLogger(__name__) + + +class QodanaDataset(Dataset): + """ MarkingArgument.ID.value is a an id of the solution that corresponds to the line + MarkingArgument.INSPECTIONS.value is a is a target column name in dataset + ColumnName.CODE.value is an observation column name in dataset where lines of code are stored + """ + + def __init__(self, data_path: str, context_length: int): + super().__init__() + df = pd.read_csv(data_path) + tokenizer = RobertaTokenizer.from_pretrained('roberta-base') + code = list(map(str, df[ColumnName.CODE.value])) + self.target = torch.tensor(df.iloc[:, 1:].astype(float).values) + self.code_encoded = tokenizer( + code, padding=True, truncation=True, max_length=context_length, return_tensors="pt", + )[DatasetColumnArgument.INPUT_IDS.value] + + def __getitem__(self, idx): + return {DatasetColumnArgument.INPUT_IDS.value: self.code_encoded[idx], + DatasetColumnArgument.LABELS.value: self.target[idx]} + + def __len__(self): + return len(self.target) diff --git a/src/python/evaluation/qodana/imitation_model/evaluation.py b/src/python/evaluation/qodana/imitation_model/evaluation.py new file mode 100644 index 00000000..9eac3ec2 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/evaluation.py @@ -0,0 +1,75 @@ +import argparse +import sys +from pathlib import Path + +import numpy as np +import pandas as pd +import torch +import transformers +from src.python.evaluation.common.csv_util import write_dataframe_to_csv +from src.python.evaluation.qodana.imitation_model.common.evaluation_config import configure_arguments +from src.python.evaluation.qodana.imitation_model.common.metric import Measurer +from src.python.evaluation.qodana.imitation_model.common.util import DatasetColumnArgument, MeasurerArgument +from src.python.evaluation.qodana.imitation_model.dataset.dataset import QodanaDataset +from src.python.review.common.file_system import Extension +from torch.utils.data import DataLoader +from transformers import RobertaForSequenceClassification + + +def get_predictions(eval_dataloader: torch.utils.data.DataLoader, + model: transformers.RobertaForSequenceClassification, + predictions: np.ndarray, + num_labels: int, + device: torch.device, + args: argparse.ArgumentParser) -> pd.DataFrame: + start_index = 0 + for batch in eval_dataloader: + with torch.no_grad(): + logits = model(input_ids=batch[DatasetColumnArgument.INPUT_IDS.value].to(device)).logits + logits = logits.sigmoid().detach().cpu().numpy() + predictions[start_index:start_index + args.batch_size, :num_labels] = (logits > args.threshold).astype(int) + start_index += args.batch_size + return pd.DataFrame(predictions, columns=range(num_labels), dtype=int) + + +def save_f1_scores(output_directory_path: Path, f1_score_by_class_dict: dict) -> None: + f1_score_report_file_name = f'{MeasurerArgument.F1_SCORES_BY_CLS.value}{Extension.CSV.value}' + f1_score_report_path = Path(output_directory_path).parent / f1_score_report_file_name + f1_score_report_df = pd.DataFrame({MeasurerArgument.F1_SCORE.value: f1_score_by_class_dict.values(), + 'inspection_id': range(len(f1_score_by_class_dict.values()))}) + write_dataframe_to_csv(f1_score_report_path, f1_score_report_df) + + +def main(): + parser = argparse.ArgumentParser() + configure_arguments(parser) + args = parser.parse_args() + if args.output_directory_path is None: + args.output_directory_path = Path(args.test_dataset_path).parent / f'predictions{Extension.CSV.value}' + + test_dataset = QodanaDataset(args.test_dataset_path, args.context_length) + num_labels = test_dataset[0][DatasetColumnArgument.LABELS.value].shape[0] + eval_dataloader = DataLoader(test_dataset, batch_size=args.batch_size) + predictions = np.zeros([len(test_dataset), num_labels], dtype=object) + + device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") + model = RobertaForSequenceClassification.from_pretrained(args.model_weights_directory_path, + num_labels=num_labels).to(device) + model.eval() + + predictions = get_predictions(eval_dataloader, model, predictions, num_labels, device, args) + true_labels = torch.tensor(pd.read_csv(args.test_dataset_path).iloc[:, 1:].to_numpy()) + metric = Measurer(args.threshold) + f1_score_by_class_dict = metric.f1_score_by_classes(torch.tensor(predictions.to_numpy()), true_labels) + + print(f"{MeasurerArgument.F1_SCORE.value}:" + f"{metric.get_f1_score(torch.tensor(predictions.to_numpy()), true_labels)}", + f"\n{MeasurerArgument.F1_SCORES_BY_CLS.value}: {f1_score_by_class_dict}") + + write_dataframe_to_csv(args.output_directory_path, predictions) + if args.save_f1_score: + save_f1_scores(args.output_directory_path, f1_score_by_class_dict) + + +if __name__ == '__main__': + sys.exit(main()) diff --git a/src/python/evaluation/qodana/imitation_model/preprocessing/README.md b/src/python/evaluation/qodana/imitation_model/preprocessing/README.md new file mode 100644 index 00000000..76e2b9d0 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/preprocessing/README.md @@ -0,0 +1,57 @@ +# Data preprocessing + +This module transforms filtered and labeled dataset into the files that can be used as input +files for [train](src/python/evaluation/qodana/imitation_model/train.py) and +[evaluation](src/python/evaluation/qodana/imitation_model/evaluation.py) scripts. + +### Step 1 + +Run [fragment_to_inspections_list.py](https://github.com/hyperskill/hyperstyle/blob/roberta-model/src/python/evaluation/qodana/fragment_to_inspections_list.py) +script to get `numbered_ids.csv` file in case of working with code-blocks or alternatively run +[fragment_to_inspections_list_line_by_line.py](https://github.com/hyperskill/hyperstyle/blob/roberta-model/src/python/evaluation/qodana/fragment_to_inspections_list_line_by_line.py) +script to get `numbered_ids_line_by_line.csv` file. + +[Detailed instructions](https://github.com/hyperskill/hyperstyle/tree/roberta-model/src/python/evaluation/qodana) +on how to run following scripts. + +### Step 2 + +Run [encode_data.py](https://github.com/hyperskill/hyperstyle/blob/roberta-model/src/python/model/preprocessing/encode_data.py) with the +following arguments: + +Required arguments: + +`dataset_path` — path to `numbered_ids_line_by_line.csv` file or `numbered_ids.csv` file. + +Optional arguments: + +Argument | Description +--- | --- +|**‑o**, **‑‑output_file_path**| Path to the directory where output file will be created. If not set, output file will be saved in the parent directory of `dataset_path`.| +|**‑ohe**, **‑‑one_hot_encoding**| If `True` target column will be represented as one-hot-encoded vector. The length of each vector is equal to the unique number of classes in dataset. Default is `True`.| +|**‑c**, **‑‑add_context**| Should be used only when `dataset_path` is a path to `numbered_ids_line_by_line.csv`. If set to `True` each single line will be substituted by a piece of code – the context created from several lines. Default is `False`.| +|**‑n**, **‑‑n_lines_to_add**| A number of lines to append to the target line before and after it. A line is appended only if it matches the same solution. If there are not enough lines in the solution, special token will be appended instead. Default is 2.| + + +#### Script functionality overview: +- creates `one-hot-encoding` vectors matches each samples each sample in the dataset **(default)**. +- substitutes `NaN` values in the dataset by `\n` symbol **(default)**. +- transform lines of code into the `context` from several lines of code **(optional)**. + +### Step 3 + +Run [`split_dataset.py`](https://github.com/hyperskill/hyperstyle/blob/roberta-model/src/python/model/preprocessing/split_dataset.py) +with the following arguments: + +Required arguments: + +`dataset_path` — path to `encoded_dataset.csv` file obtained by running [encode_data.py](https://github.com/hyperskill/hyperstyle/blob/roberta-model/src/python/model/preprocessing/encode_data.py) script. + +Optional arguments: + +Argument | Description +--- | --- +|**‑o**, **‑‑output_directory_path**| Path to the directory where folders for train, test and validation datasets with the corresponding files will be created. If not set, folders will be created in the parent directory of `dataset_path`.| +|**‑ts**, **‑‑test_size**| Proportion of test dataset. Available values: 0 < n < 1. Default is 0.2.| +|**‑vs**, **‑‑val_size**| Proportion of validation dataset that will be taken from train dataset. Available values are: 0 < n < 1. Default is 0.3.| +|**‑sh**, **‑‑shuffle**| If `True` data will be shuffled before split. Default is `True`.| diff --git a/src/python/evaluation/qodana/imitation_model/preprocessing/__init__.py b/src/python/evaluation/qodana/imitation_model/preprocessing/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/python/evaluation/qodana/imitation_model/preprocessing/encode_data.py b/src/python/evaluation/qodana/imitation_model/preprocessing/encode_data.py new file mode 100644 index 00000000..8b57888a --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/preprocessing/encode_data.py @@ -0,0 +1,162 @@ +import argparse +import logging +import sys +from itertools import chain +from pathlib import Path +from typing import List + +import numpy as np +import pandas as pd +from sklearn.preprocessing import MultiLabelBinarizer +from src.python.evaluation.common.csv_util import write_dataframe_to_csv +from src.python.evaluation.common.util import ColumnName +from src.python.evaluation.qodana.imitation_model.common.util import CustomTokens, DatasetColumnArgument +from src.python.review.common.file_system import Extension + + +logger = logging.getLogger(__name__) +sys.path.append('') +sys.path.append('../../../../..') + + +def configure_arguments(parser: argparse.ArgumentParser) -> None: + parser.add_argument('dataset_path', + type=lambda value: Path(value).absolute(), + help='Path to the dataset with the values to be encoded. ') + + parser.add_argument('-o', '--output_file_path', + help='Output file path. If not set, file will be saved to ' + 'the input file parent directory.', + type=str, + default='input_file_directory') + + parser.add_argument('-c', '--add_context', + help='Use for the datasets with code lines only, if set to True, ' + 'n lines before and n lines after target line will be added to each sample.' + ' Default is False.', + action='store_true') + + parser.add_argument('-n', '--n_lines_to_add', + help='Use only if add_context is enabled. Allows to add n-lines from the same piece of code, ' + 'before and after each line in the dataset. If there are no lines before or after a line' + 'from the same code-sample, special token will be added. Default is 2.', + default=2, + type=int) + + parser.add_argument('-ohe', '--one_hot_encoding', + help='If True, target column will be represented as one-hot-encoded vector. ' + 'The length of each vector is equal to the unique number of classes. ' + 'Default is True.', + action='store_false') + + +def __one_hot_encoding(df: pd.DataFrame) -> pd.DataFrame: + """ Transforms strings in 'inspections' column, + denoting inspection ids into n columns + with binary values: + + 1 x n_rows -> n_unique_classes x n_rows + + Where n_unique_classes is equal to the number + of unique inspections in the dataset. + + Example: + inspections -> 1, 2, 3 + '1, 2' 1 1 0 + '3' 0 0 1 + """ + target = df[DatasetColumnArgument.INSPECTIONS.value].to_numpy().astype(str) + target_list_int = [np.unique(tuple(map(int, label.split(',')))) for label in target] + try: + mlb = MultiLabelBinarizer() + encoded_target = mlb.fit_transform(target_list_int) + assert len(list(set(chain.from_iterable(target_list_int)))) == encoded_target.shape[1] + encoded_target = pd.DataFrame(data=encoded_target, columns=range(encoded_target.shape[1])) + return encoded_target + except AssertionError as e: + logger.error('encoded_target.shape[1] should be equal to number of classes') + raise e + + +class Context: + """ To each line of code add context from the same solution: + 'n_lines_before' line 'n_lines_after'. + If there are no lines before or / and after a piece of code, + special tokens are added. + """ + def __init__(self, df: pd.DataFrame, n_lines: int): + self.indices = df[DatasetColumnArgument.ID.value].to_numpy() + self.lines = df[ColumnName.CODE.value] + self.n_lines: int = n_lines + self.df = df + + def add_context_to_lines(self) -> pd.DataFrame: + lines_with_context = [] + for current_line_index, current_line in enumerate(self.lines): + context = self.add_context_before(current_line_index, current_line) + context = self.add_context_after(context, current_line_index) + lines_with_context.append(context[0]) + lines_with_context = pd.Series(lines_with_context) + self.df[ColumnName.CODE.value] = lines_with_context + return self.df + + def add_context_before(self, current_line_index: int, current_line: str) -> List[str]: + """ Add n_lines lines before the target line from the same piece of code, + If there are less than n lines above the target line will add + a special token. + """ + context = [''] + for n_line_index in range(current_line_index - self.n_lines, self.n_lines): + if n_line_index >= len(self.lines): + return context + if n_line_index == 0 or self.indices[n_line_index] != self.indices[current_line_index]: + context = [context[0] + CustomTokens.NOC.value] + else: + context = [context[0] + self.lines.iloc[n_line_index]] + if n_line_index != self.n_lines - 1: + context = [context[0] + '\n'] + context = [context[0] + current_line] + return context + + def add_context_after(self, context: List, current_line_index: int) -> List[str]: + """ Add n_lines lines after the target line from the same piece of code, + If there are less than n lines after the target line will add + a special token. + """ + for n_line_index in range(current_line_index + 1, self.n_lines + current_line_index + 1): + if n_line_index >= len(self.lines) or self.indices[n_line_index] != self.indices[current_line_index]: + context = [context[0] + CustomTokens.NOC.value] + else: + context = [context[0] + self.lines.iloc[n_line_index]] + if n_line_index != self.n_lines - 1: + context = [context[0] + '\n'] + return context + + +def main() -> None: + parser = argparse.ArgumentParser() + configure_arguments(parser) + args = parser.parse_args() + + dataset_path = args.dataset_path + output_file_path = args.output_file_path + + if output_file_path == 'input_file_directory': + output_file_path = Path(dataset_path).parent / f'encoded_dataset{Extension.CSV.value}' + + # nan -> \n (empty rows) + df = pd.read_csv(dataset_path) + df[ColumnName.CODE.value].fillna('\n', inplace=True) + + if args.one_hot_encoding: + target = __one_hot_encoding(df) + df = pd.concat([df[[ColumnName.ID.value, ColumnName.CODE.value]], target], axis=1) + + if args.add_context: + df = Context(df, args.n_lines_to_add).add_context_to_lines() + + write_dataframe_to_csv(output_file_path, df) + + +if __name__ == '__main__': + main() diff --git a/src/python/evaluation/qodana/imitation_model/preprocessing/split_dataset.py b/src/python/evaluation/qodana/imitation_model/preprocessing/split_dataset.py new file mode 100644 index 00000000..41a4319b --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/preprocessing/split_dataset.py @@ -0,0 +1,77 @@ +import argparse +import os +from pathlib import Path + +import pandas as pd +from sklearn.model_selection import train_test_split +from src.python.evaluation.common.csv_util import write_dataframe_to_csv +from src.python.evaluation.common.util import ColumnName +from src.python.evaluation.qodana.imitation_model.common.util import SeedArgument +from src.python.review.common.file_system import Extension + + +def configure_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser() + parser.add_argument('dataset_path', + type=str, + help=f'Path to the dataset received by either' + f' src.python.evaluation.qodana.fragment_to_inspections_list{Extension.PY.value}' + f'or src.python.evaluation.qodana.fragment_to_inspections_list_line_by_line' + f'{Extension.PY.value}script.') + + parser.add_argument('-d', '--output_directory_path', + type=str, + default=None, + help='Path to the directory where folders for train, test and validation datasets will be ' + 'created. If not set directories will be created in the parent directory of dataset_path') + + parser.add_argument('-ts', '--test_size', + type=int, + default=0.2, + help='Rate of test size from the whole dataset. Default is 0.2') + + parser.add_argument('-vs', '--val_size', + type=int, + default=0.3, + help='Rate of validation dataset from the train dataset. Default is 0.3 ') + + parser.add_argument('-sh', '--shuffle', + type=bool, + default=True, + help='If true, data will be shuffled before splitting. Default is True.') + + return parser + + +def split_dataset(dataset_path: str, output_directory_path: str, val_size: float, test_size: float, shuffle: bool): + df = pd.read_csv(dataset_path) + target = df.iloc[:, 2:] + code_bank = df[ColumnName.CODE.value] + + code_train, code_test, target_train, target_test = train_test_split(code_bank, + target, + test_size=test_size, + random_state=SeedArgument.SEED.value, + shuffle=shuffle) + + code_train, code_val, target_train, target_val = train_test_split(code_train, + target_train, + test_size=val_size, + random_state=SeedArgument.SEED.value, + shuffle=shuffle) + if output_directory_path is None: + output_directory_path = Path(dataset_path).parent + + for holdout in [("train", code_train, target_train), + ("val", code_val, target_val), + ("test", code_test, target_test)]: + df = pd.concat([holdout[1], holdout[2]], axis=1) + os.makedirs(os.path.join(output_directory_path, holdout[0]), exist_ok=True) + write_dataframe_to_csv(Path(output_directory_path) / holdout[0] / f'{holdout[0]}{Extension.CSV.value}', df) + + +if __name__ == "__main__": + parser = configure_parser() + args = parser.parse_args() + + split_dataset(args.dataset_path, args.output_directory_path, args.val_size, args.test_size, args.shuffle) diff --git a/src/python/evaluation/qodana/imitation_model/train.py b/src/python/evaluation/qodana/imitation_model/train.py new file mode 100644 index 00000000..fc458d95 --- /dev/null +++ b/src/python/evaluation/qodana/imitation_model/train.py @@ -0,0 +1,46 @@ +import argparse +import os +import sys +from pathlib import Path + +import torch +from src.python.evaluation.qodana.imitation_model.common.metric import Measurer +from src.python.evaluation.qodana.imitation_model.common.train_config import ( + configure_arguments, MultilabelTrainer, TrainingArgs, +) +from src.python.evaluation.qodana.imitation_model.common.util import DatasetColumnArgument +from src.python.evaluation.qodana.imitation_model.dataset.dataset import QodanaDataset +from transformers import RobertaForSequenceClassification + + +def main(): + parser = argparse.ArgumentParser() + configure_arguments(parser) + args = parser.parse_args() + device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") + train_dataset = QodanaDataset(args.train_dataset_path, args.context_length) + val_dataset = QodanaDataset(args.val_dataset_path, args.context_length) + train_steps_to_be_made = len(train_dataset) // args.batch_size + val_steps_to_be_made = train_steps_to_be_made // 5 + print(f'Steps to be made: {train_steps_to_be_made}, validate each {val_steps_to_be_made}th step.') + + num_labels = train_dataset[0][DatasetColumnArgument.LABELS.value].shape[0] + model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=num_labels).to(device) + + metrics = Measurer(args.threshold) + if args.trained_weights_directory_path is None: + args.trained_weights_directory_path = Path(args.train_dataset_path).parent / DatasetColumnArgument.WEIGHTS.value + os.makedirs(args.trained_weights_directory_path, exist_ok=True) + + train_args = TrainingArgs(args) + + trainer = MultilabelTrainer(model=model, + args=train_args.get_training_args(val_steps_to_be_made), + train_dataset=train_dataset, + eval_dataset=val_dataset, + compute_metrics=metrics.compute_metric) + trainer.train() + + +if __name__ == '__main__': + sys.exit(main()) diff --git a/whitelist.txt b/whitelist.txt index d5fed167..45928058 100644 --- a/whitelist.txt +++ b/whitelist.txt @@ -143,4 +143,8 @@ idx QodanaDataset cuda f1 -WANDB \ No newline at end of file +WANDB +PNG +consts +Measurer +ndarray \ No newline at end of file