Evaluation Metrics for Large Language Models (LLMs)

This tool evaluates the performance of Large Language Models (LLMs) using various metrics to quantify the similarity and quality of generated texts compared to reference texts. It supports multiple metrics, including Jaccard Index, cosine similarity, BERTScore, and machine translation metrics, tailored to compare biomedical or domain-specific texts.

Features

Calculate a range of evaluation metrics, including keyword-based, similarity-based, and model-based scoring methods.
Use a custom keyword dictionary to adjust keyword sensitivity.
Input files can be in CSV or JSON format.
Saves metric results in a user-defined JSON output file.

Usage

Command Line Arguments

Argument	Description	Required
`input_file`	Path to the input file (CSV or JSON) containing the texts to evaluate.	Yes
`output_dir`	Path to the output directory where the computed metrics will be stored.	Yes
`dict_path`	(Optional) Path to a custom JSON keyword dictionary for keyword-based metrics. Defaults to PALGA thesaurus	No
`metrics`	List of metrics to compute. Available options: `jaccard_index_dictionary`, `jaccard_index_scispacy`, `cosine_similarity_biobert`, `machine_translation_metrics`, `bert_score_metric`. Defaults to all metrics if not specified.	No

Input File Format

The input file should contain two columns labeled original and generated. The original column contains reference texts, and the generated column contains the corresponding texts generated by the LLM. The input file can be in CSV or JSON format.

Example CSV Format

original,generated
original_text1,generated_text1
original_text2,generated_text2

Example JSON Format

{
  "original": ["original_text1", "original_text2"],
  "generated": ["generated_text1", "generated_text2"]
}

Available Metrics

jaccard_index_dictionary: Calculates the Jaccard Index based on a custom dictionary of keywords.
jaccard_index_scispacy: Uses the SciSpacy model to calculate the Jaccard Index, focusing on biomedical entities.
cosine_similarity_biobert: Measures cosine similarity using BioBERT embeddings, suitable for biomedical text comparisons.
machine_translation_metrics: Evaluates using machine translation metrics like BLEU (1-4), ROUGE, CIDER, METEOR, commonly used in translation quality assessment.
bert_score_metric: Computes BERTScore, which uses contextual embeddings for a more semantic similarity measure.

Example Command

To run the evaluation with all metrics and the default dictionary:

python src/evaluate.py  path_to_input/input.csv  path_to_output

To specify specfic metrics and use a custom dictionary:

python evaluate.py  path/to/input.json  path/to/output_dir/ --dict-path path/to/dictionary.json  --metrics bert_score_metric --metrics machine_translation_metics

Output

There will be two JSON output files generated results.json which contains the results for each metric computed on each input text pair, as well as results_averaged.json which contains the results averaged results across all report pairs computed for each metrics. For example for the default configuration, results_averaged.json will look like this:

{
  "jaccard_index_dictionary_mean": "score",
  "jaccard_index_scispacy_mean": "score",
  "cosine_similarity_biobert_mean": "score",
  "BLEU-1_mean": "score",
  "BLEU-2_mean": "score",
  "BLEU-3_mean": "score",
  "BLEU-4_mean": "score",
  "Cider_mean": "score",
  "ROUGE-L_mean": "score",
  "Meteor_mean": "score",
  "Bertscore_p_mean": "score",
  "Bertscore_r_mean": "score",
  "Bertscore_f1_mean": "score",
  
}

Local or for ARM users

Steps to Install Locally: Install Conda (e.g., Miniconda or Anaconda).

Create the Environment:

conda env create -f environment.yml

Activate the Environment:

conda activate translator

Run the Script:

python src/evaluate.py  path_to_input/input.csv  path_to_output_dir/

Building Docker Image

Navigate to the Dockerfile and then run:

docker build -t evaluator:latest .

Running the script with SLURM

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=4:00:00
#SBATCH --container-mounts=/path_to_data_dir/:/data_dir/
#SBATCH --container-image="tag"

# Run the Python script with arguments
python src/evaluate.py /data_dir/data.csv /data_dir/output_dir/

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation Metrics for Large Language Models (LLMs)

Features

Usage

Command Line Arguments

Input File Format

Example CSV Format

Example JSON Format

Available Metrics

Example Command

Output

Local or for ARM users

Building Docker Image

Running the script with SLURM

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluation Metrics for Large Language Models (LLMs)

Features

Usage

Command Line Arguments

Input File Format

Example CSV Format

Example JSON Format

Available Metrics

Example Command

Output

Local or for ARM users

Building Docker Image

Running the script with SLURM

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages