hyperskill · nbirillo · Jun 10, 2021 · May 17, 2021 · May 17, 2021 · May 17, 2021
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -23,6 +23,7 @@ jobs:
         pip install -r requirements.txt
         pip install -r requirements-test.txt
         pip install -r requirements-evaluation.txt
+        pip install -r requirements-roberta.txt
     - name: Lint with flake8
       run: |
         # stop the build if there are Python syntax errors or undefined names

diff --git a/requirements-roberta.txt b/requirements-roberta.txt
@@ -0,0 +1,6 @@
+tqdm==4.49.0
+scikit-learn~=0.24.2
+transformers==4.6.1
+tokenizers==0.10.2
+torch==1.8.1
+wandb==0.10.31
diff --git a/src/python/evaluation/qodana/imitation_model/README.md b/src/python/evaluation/qodana/imitation_model/README.md
@@ -0,0 +1,118 @@
+# Qodana imitation model 
+## Description
+The general purpose of the model is to simulate the behavior of [`Qodana`](https://github.com/JetBrains/Qodana/tree/main) – 
+a code quality monitoring tool that identifies and suggests fixes for bugs, security vulnerabilities, duplications, and imperfections.
+
+Motivation for developing a model:
+- acceleration of the code analysis process by training the model to recognize a certain class of errors;
+- the ability to run the model on separate files without the need to create a project (for example, for the Java language)
+
+
+## Architecture 
+[`RobertaForSequenceClassification`](https://huggingface.co/transformers/model_doc/roberta.html#robertaforsequenceclassification) model with [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html) solve multilabel classification task. 
+
+Model outputs is a tensor of size: `batch_size`  x `num_classes`. Where `batch_size` is the number of training examples utilized in one iteration, 
+and `num_classes` is the number of error types met in the dataset. By model class here, we mean a unique error type.
+Class probabilities are received by taking `sigmoid` and final predictions are computed by comparing the probability of each class with the `threshold`. 
+
+As classes might be unbalanced the used metric is `f1-score`.
+## What it does
+
+Model has two use cases:
+- It can be trained to predict a unique number of errors in a **block** of code, unfixed length. 
+
+**Example**: 
+
+code | inspections
+--- | ---
+|`import java.util.Scanner; class Main {public static void main(String[] args) {Scanner scanner = new Scanner(System.in);// put your code here int num = scanner.nextInt(); System.out.println((num / 10 ) % 10);}}`| 1, 2|
+
+
+- It can be trained to predict a unique number of errors in a **line** of code. 
+
+**Example**
+
+code | inspections
+--- | ---
+|`import java.util.Scanner;`| 0|
+|`\n`|0|
+|`class Main {`|1|
+|`public static void main(String[] args`) {|1|
+|`Scanner scanner = new Scanner(System.in);`|0|
+|`// put your code here`|0|
+|`int num = scanner.nextInt();`|0|
+|`System.out.println((num / 10 ) % 10);`|2|
+|`}`|0|
+|`}`|0|
+
+
+## Data preprocessing
+
+Please address to the [`following documentation`](src/python/evaluation/qodana) for labeling dataset and to the [`following documentation`](preprocessing) to preprocess data for model training and evaluation afterwards. 
+
+After completing the 3d preprocessing step you should have 3 folders:
+`train`, `val`, `test` with `train.csv`, `val.csv` and `test.csv` respectively.
+
+Each file has the same structure, it should consist of 4+ columns:
+- `id` – solutions id;
+- `code` – line od code or block of code;
+- `lang` - language version;
+- `0`, `1`, `2` ... `n` – several columns, equal to the unique number of errors detected by Qodana in the dataset.
+The values in the columns are binary numbers: `1` if inspection is detected and `0` otherwise.
+
+
+## How to train the model
+
+Run [`train.py`](train.py) script from the command line with the following arguments:
+
+Required arguments:
+
+- `train_dataset_path`  &#8209; path to the `train.csv` – file that consists of samples
+that model will use for training.
+
+- `val_dataset_path` &#8209; path to the `val.csv` – file that consists of samples
+that model will use for evaluation during training.
+
+Both files are received by running [`split_dataset.py`](preprocessing/split_dataset.py) script and has the structure as described above.
+
+Optional arguments:
+
+Argument | Description
+--- | ---
+|**&#8209;o**, **&#8209;&#8209;output_directory_path**| Path to the directory where model weights will be saved. If not set, folder will be created in the `train` folder where `train.csv` dataset is stored.|
+|**&#8209;c**, **&#8209;&#8209;context_length**| Sequence length or embedding size of tokenized samples. Available values are any `positive integers`. **Default is 40**.|
+|**&#8209;e**, **&#8209;&#8209;epoch**| Number of epochs to train model. **Default is 2**.|
+|**&#8209;bs**, **&#8209;&#8209;batch_size**| Batch size for training and validation dataset. Available values are any `positive integers`. **Default is 16**.|
+|**&#8209;lr**, **&#8209;&#8209;learning_rate**| Optimizer learning rate. **Default is 2e-5**.|
+|**&#8209;w**, **&#8209;&#8209;weight_decay**| Weight decay parameter for an optimizer. **Default is 0.01**.|
+|**&#8209;th**, **&#8209;&#8209;threshold**| Is used to compute predictions. Available values: 0 < `threshold` < 1. If the probability of inspection is greater than `threshold`, sample will be classified with the inspection. **Default is 0.5**.|
+|**&#8209;ws**, **&#8209;&#8209;warm_up_steps**| A number of steps when optimizer uses constant learning rate before applying scheduler policy. **Default is 300**.|
+|**&#8209;sl**, **&#8209;&#8209;save_limit**| Total amount of checkpoints limit. Default is 1.|
+
+To inspect the rest of default training parameters please, address to the [`TrainingArguments`](common/train_config.py).
+
+## How to evaluate model
+
+Run [`evaluation.py`](evaluation.py) script from the command line with the following arguments:
+
+Required arguments:
+
+`test_dataset_path` &#8209; path to the `test.csv` received by running [`split_dataset.py`](preprocessing/split_dataset.py) script.
+
+`model_weights_directory_path` &#8209; path to the folder where trained model weights are saved.
+
+Optional arguments:
+
+Argument | Description
+--- | ---
+|**&#8209;o**, **&#8209;&#8209;output_directory_path**| Path to the directory where labeled dataset will be saved. Default is the `test` folder.|
+|**&#8209;c**, **&#8209;&#8209;context_length**| Sequence length or embedding size of tokenized samples. Available values are any `positive integers`. **Default is 40**.|
+|**&#8209;sf**, **&#8209;&#8209;save_f1_score**| If enabled report with f1 scores by classes will be saved to the `csv` file in the parent directory of labeled dataset. **Disabled by default**.|
+|**&#8209;bs**, **&#8209;&#8209;batch_size**| The number of training examples utilized in one training and validation iteration. Available values are any `positive integers`. **Default is 16**.|
+|**&#8209;th**, **&#8209;&#8209;threshold**| Is used to compute predictions. Available values: 0 < `threshold` < 1. If the probability of inspection is greater than `threshold`, sample will be classified with the inspection. **Default is 0.5**.|
+
+Output is a `predictions.csv` file with the column names matches the number of classes. Each sample has a binary label: 
+
+- `0` &#8209; if the model didn't found an error in a sample.
+
+- `1` &#8209; if the error was found in a sample.
diff --git a/src/python/evaluation/qodana/imitation_model/__init__.py b/src/python/evaluation/qodana/imitation_model/__init__.py
@@ -0,0 +1,3 @@
+from src.python import MAIN_FOLDER
+
+MODEL_FOLDER = MAIN_FOLDER.parent / 'python/imitation_model'
diff --git a/src/python/evaluation/qodana/imitation_model/common/__init__.py b/src/python/evaluation/qodana/imitation_model/common/__init__.py
diff --git a/src/python/evaluation/qodana/imitation_model/common/evaluation_config.py b/src/python/evaluation/qodana/imitation_model/common/evaluation_config.py
@@ -0,0 +1,47 @@
+import argparse
+
+from src.python.evaluation.qodana.imitation_model.common.util import ModelCommonArgument
+from src.python.review.common.file_system import Extension
+
+
+def configure_arguments(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument('test_dataset_path',
+                        type=str,
+                        help='Path to the dataset received by either'
+                             f' src.python.evaluation.qodana.fragment_to_inspections_list{Extension.PY.value}'
+                             'or src.python.evaluation.qodana.fragment_to_inspections_list_line_by_line'
+                             f'{Extension.PY.value}script.')
+
+    parser.add_argument('model_weights_directory_path',
+                        type=str,
+                        help='Path to the directory where trained imitation_model weights are stored.')
+
+    parser.add_argument('-o', '--output_directory_path',
+                        default=None,
+                        type=str,
+                        help='Path to the directory where labeled dataset will be saved. Default is the parent folder'
+                             'of test_dataset_path.')
+
+    parser.add_argument('-sf', '--save_f1_score',
+                        default=None,
+                        action="store_true",
+                        help=f'If enabled report with f1 scores by class will be saved to the {Extension.CSV.value}'
+                             ' File will be saved to the labeled dataset parent directory. Default is False.')
+
+    parser.add_argument(ModelCommonArgument.CONTEXT_LENGTH.value.short_name,
+                        ModelCommonArgument.CONTEXT_LENGTH.value.long_name,
+                        type=int,
+                        default=40,
+                        help=ModelCommonArgument.CONTEXT_LENGTH.value.description)
+
+    parser.add_argument(ModelCommonArgument.BATCH_SIZE.value.short_name,
+                        ModelCommonArgument.BATCH_SIZE.value.long_name,
+                        type=int,
+                        default=8,
+                        help=ModelCommonArgument.BATCH_SIZE.value.description)
+
+    parser.add_argument(ModelCommonArgument.THRESHOLD.value.short_name,
+                        ModelCommonArgument.THRESHOLD.value.long_name,
+                        type=float,
+                        default=0.5,
+                        help=ModelCommonArgument.THRESHOLD.value.description)
diff --git a/src/python/evaluation/qodana/imitation_model/common/metric.py b/src/python/evaluation/qodana/imitation_model/common/metric.py
@@ -0,0 +1,41 @@
+import logging.config
+from typing import Optional
+
+import torch
+from sklearn.metrics import multilabel_confusion_matrix
+from src.python.evaluation.qodana.imitation_model.common.util import MeasurerArgument
+
+logger = logging.getLogger(__name__)
+
+
+class Measurer:
+    def __init__(self, threshold: float):
+        self.threshold = threshold
+
+    def get_f1_score(self, predictions: torch.tensor, targets: torch.tensor) -> Optional[float]:
+        confusion_matrix = multilabel_confusion_matrix(targets, predictions)
+        false_positives = sum(score[0][1] for score in confusion_matrix)
+        false_negatives = sum(score[1][0] for score in confusion_matrix)
+        true_positives = sum(score[1][1] for score in confusion_matrix)
+        try:
+            f1_score = true_positives / (true_positives + 1 / 2 * (false_positives + false_negatives))
+            return f1_score
+        except ZeroDivisionError:
+            logger.error("No values of the class present in the dataset.")
+            # return None to make it clear after printing what classes are missing in the datasets
+            return None
+
+    def compute_metric(self, evaluation_predictions: torch.tensor) -> dict:
+        logits, targets = evaluation_predictions
+        prediction_probabilities = torch.from_numpy(logits).sigmoid()
+        predictions = torch.where(prediction_probabilities > self.threshold, 1, 0)
+        return {MeasurerArgument.F1_SCORE.value: self.get_f1_score(predictions, torch.tensor(targets))}
+
+    def f1_score_by_classes(self, predictions: torch.tensor, targets: torch.tensor) -> dict:
+        unique_classes = range(len(targets[0]))
+        f1_scores_by_classes = {}
+        for unique_class in unique_classes:
+            class_mask = torch.where(targets[:, unique_class] == 1)
+            f1_scores_by_classes[str(unique_class)] = self.get_f1_score(predictions[class_mask[0], unique_class],
+                                                                        targets[class_mask[0], unique_class])
+        return f1_scores_by_classes
diff --git a/src/python/evaluation/qodana/imitation_model/common/train_config.py b/src/python/evaluation/qodana/imitation_model/common/train_config.py
@@ -0,0 +1,118 @@
+import argparse
+
+import torch
+from src.python.evaluation.qodana.imitation_model.common.util import (
+    DatasetColumnArgument,
+    ModelCommonArgument,
+    SeedArgument,
+)
+from transformers import Trainer, TrainingArguments
+
+
+class MultilabelTrainer(Trainer):
+    """ By default RobertaForSequence classification does not support
+        multi-label classification.
+
+        Target and logits tensors should be represented as torch.FloatTensor of shape (1,).
+        https://huggingface.co/transformers/model_doc/roberta.html#transformers.RobertaForSequenceClassification
+
+        To fine-tune the model for the multi-label classification task we can simply modify the trainer by
+        changing its loss function. https://huggingface.co/transformers/main_classes/trainer.html
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def compute_loss(self, model, inputs, return_outputs=False):
+        labels = inputs.pop(DatasetColumnArgument.LABELS.value)
+        outputs = model(**inputs)
+        logits = outputs.logits
+        loss_bce = torch.nn.BCEWithLogitsLoss()
+        loss = loss_bce(logits.view(-1, self.model.config.num_labels),
+                        labels.float().view(-1, self.model.config.num_labels))
+
+        return (loss, outputs) if return_outputs else loss
+
+
+def configure_arguments(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument('train_dataset_path',
+                        type=str,
+                        help='Path to the train dataset.')
+
+    parser.add_argument('val_dataset_path',
+                        type=str,
+                        help='Path to the dataset received by either')
+
+    parser.add_argument('-wp', '--trained_weights_directory_path',
+                        default=None,
+                        type=str,
+                        help='Path to the directory where to save imitation_model weights. Default is the directory'
+                             'where train dataset is.')
+
+    parser.add_argument(ModelCommonArgument.CONTEXT_LENGTH.value.short_name,
+                        ModelCommonArgument.CONTEXT_LENGTH.value.long_name,
+                        type=int,
+                        default=40,
+                        help=ModelCommonArgument.CONTEXT_LENGTH.value.description)
+
+    parser.add_argument(ModelCommonArgument.BATCH_SIZE.value.short_name,
+                        ModelCommonArgument.BATCH_SIZE.value.long_name,
+                        type=int,
+                        default=16,
+                        help=ModelCommonArgument.BATCH_SIZE.value.description)
+
+    parser.add_argument(ModelCommonArgument.THRESHOLD.value.short_name,
+                        ModelCommonArgument.THRESHOLD.value.long_name,
+                        type=float,
+                        default=0.5,
+                        help=ModelCommonArgument.THRESHOLD.value.description)
+
+    parser.add_argument('-lr', '--learning_rate',
+                        type=int,
+                        default=2e-5,
+                        help='Learning rate.')
+
+    parser.add_argument('-wd', '--weight_decay',
+                        type=int,
+                        default=0.01,
+                        help='Wight decay parameter for optimizer.')
+
+    parser.add_argument('-e', '--epoch',
+                        type=int,
+                        default=1,
+                        help='Number of epochs to train imitation_model.')
+
+    parser.add_argument('-ws', '--warm_up_steps',
+                        type=int,
+                        default=300,
+                        help='Number of steps used for a linear warmup, default is 300.')
+
+    parser.add_argument('-sl', '--save_limit',
+                        type=int,
+                        default=1,
+                        help='Total amount of checkpoints limit. Default is 1.')
+
+
+class TrainingArgs:
+    def __init__(self, args):
+        self.args = args
+
+    def get_training_args(self, val_steps_to_be_made):
+        return TrainingArguments(num_train_epochs=self.args.epoch,
+                                 per_device_train_batch_size=self.args.batch_size,
+                                 per_device_eval_batch_size=self.args.batch_size,
+                                 learning_rate=self.args.learning_rate,
+                                 warmup_steps=self.args.warm_up_steps,
+                                 weight_decay=self.args.weight_decay,
+                                 save_total_limit=self.args.save_limit,
+                                 output_dir=self.args.trained_weights_directory_path,
+                                 overwrite_output_dir=True,
+                                 load_best_model_at_end=True,
+                                 greater_is_better=True,
+                                 save_steps=val_steps_to_be_made,
+                                 eval_steps=val_steps_to_be_made,
+                                 logging_steps=val_steps_to_be_made,
+                                 evaluation_strategy=DatasetColumnArgument.STEPS.value,
+                                 logging_strategy=DatasetColumnArgument.STEPS.value,
+                                 seed=SeedArgument.SEED.value,
+                                 report_to=[DatasetColumnArgument.WANDB.value])
diff --git a/src/python/evaluation/qodana/imitation_model/common/util.py b/src/python/evaluation/qodana/imitation_model/common/util.py
@@ -0,0 +1,46 @@
+from enum import Enum, unique
+
+from src.python.common.tool_arguments import ArgumentsInfo
+
+
+@unique
+class DatasetColumnArgument(Enum):
+    ID = 'id'
+    IN_ID = 'inspection_id'
+    INSPECTIONS = 'inspections'
+    INPUT_IDS = 'input_ids'
+    LABELS = 'labels'
+    DATASET_PATH = 'dataset_path'
+    STEPS = 'steps'
+    WEIGHTS = 'weights'
+    WANDB = 'wandb'
+
+
+@unique
+class SeedArgument(Enum):
+    SEED = 42
+
+
+@unique
+class CustomTokens(Enum):
+    NOC = '[NOC]'  # no context token to add when there are no lines for the context
+
+
+@unique
+class ModelCommonArgument(Enum):
+    THRESHOLD = ArgumentsInfo('-th', '--threshold',
+                              'If the probability of inspection on code sample is greater than threshold,'
+                              'inspection id will be assigned to the sample. '
+                              'Default is 0.5.')
+
+    CONTEXT_LENGTH = ArgumentsInfo('-cl', '--context_length',
+                                   'Sequence length of 1 sample after tokenization, default is 40.')
+
+    BATCH_SIZE = ArgumentsInfo('-bs', '--batch_size',
+                               'Batch size – default values are 16 for training and 8 for evaluation mode.')
+
+
+@unique
+class MeasurerArgument(Enum):
+    F1_SCORE = 'f1_score'
+    F1_SCORES_BY_CLS = 'f1_scores_by_class'
diff --git a/src/python/evaluation/qodana/imitation_model/dataset/__init__.py b/src/python/evaluation/qodana/imitation_model/dataset/__init__.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from src.python import MAIN_FOLDER

		MODEL_FOLDER = MAIN_FOLDER.parent / 'python/imitation_model'