This tool evaluates the performance of Large Language Models (LLMs) using various metrics to quantify the similarity and quality of generated texts compared to reference texts. It supports multiple metrics, including Jaccard Index, cosine similarity, BERTScore, and machine translation metrics, tailored to compare biomedical or domain-specific texts.
- Calculate a range of evaluation metrics, including keyword-based, similarity-based, and model-based scoring methods.
- Use a custom keyword dictionary to adjust keyword sensitivity.
- Input files can be in CSV or JSON format.
- Saves metric results in a user-defined JSON output file.
| Argument | Description | Required |
|---|---|---|
input_file |
Path to the input file (CSV or JSON) containing the texts to evaluate. | Yes |
output_dir |
Path to the output directory where the computed metrics will be stored. | Yes |
dict_path |
(Optional) Path to a custom JSON keyword dictionary for keyword-based metrics. Defaults to PALGA thesaurus | No |
metrics |
List of metrics to compute. Available options: jaccard_index_dictionary, jaccard_index_scispacy, cosine_similarity_biobert, machine_translation_metrics, bert_score_metric. Defaults to all metrics if not specified. |
No |
The input file should contain two columns labeled original and generated. The original column contains reference texts, and the generated column contains the corresponding texts generated by the LLM. The input file can be in CSV or JSON format.
original,generated
original_text1,generated_text1
original_text2,generated_text2{
"original": ["original_text1", "original_text2"],
"generated": ["generated_text1", "generated_text2"]
}
jaccard_index_dictionary: Calculates the Jaccard Index based on a custom dictionary of keywords.jaccard_index_scispacy: Uses the SciSpacy model to calculate the Jaccard Index, focusing on biomedical entities.cosine_similarity_biobert: Measures cosine similarity using BioBERT embeddings, suitable for biomedical text comparisons.machine_translation_metrics: Evaluates using machine translation metrics like BLEU (1-4), ROUGE, CIDER, METEOR, commonly used in translation quality assessment.bert_score_metric: Computes BERTScore, which uses contextual embeddings for a more semantic similarity measure.
To run the evaluation with all metrics and the default dictionary:
python src/evaluate.py path_to_input/input.csv path_to_outputTo specify specfic metrics and use a custom dictionary:
python evaluate.py path/to/input.json path/to/output_dir/ --dict-path path/to/dictionary.json --metrics bert_score_metric --metrics machine_translation_meticsThere will be two JSON output files generated results.json which contains the results for each metric computed on each input text pair, as well as results_averaged.json which contains the results averaged results across all report pairs computed for each metrics. For example for the default configuration, results_averaged.json will look like this:
{
"jaccard_index_dictionary_mean": "score",
"jaccard_index_scispacy_mean": "score",
"cosine_similarity_biobert_mean": "score",
"BLEU-1_mean": "score",
"BLEU-2_mean": "score",
"BLEU-3_mean": "score",
"BLEU-4_mean": "score",
"Cider_mean": "score",
"ROUGE-L_mean": "score",
"Meteor_mean": "score",
"Bertscore_p_mean": "score",
"Bertscore_r_mean": "score",
"Bertscore_f1_mean": "score",
}
Steps to Install Locally: Install Conda (e.g., Miniconda or Anaconda).
Create the Environment:
conda env create -f environment.ymlActivate the Environment:
conda activate translatorRun the Script:
python src/evaluate.py path_to_input/input.csv path_to_output_dir/Navigate to the Dockerfile and then run:
docker build -t evaluator:latest .#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=4:00:00
#SBATCH --container-mounts=/path_to_data_dir/:/data_dir/
#SBATCH --container-image="tag"
# Run the Python script with arguments
python src/evaluate.py /data_dir/data.csv /data_dir/output_dir/ This project is licensed under the MIT License.