Skip to content

LeJudith/eval_llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluation Metrics for Large Language Models (LLMs)

This tool evaluates the performance of Large Language Models (LLMs) using various metrics to quantify the similarity and quality of generated texts compared to reference texts. It supports multiple metrics, including Jaccard Index, cosine similarity, BERTScore, and machine translation metrics, tailored to compare biomedical or domain-specific texts.

Features

  • Calculate a range of evaluation metrics, including keyword-based, similarity-based, and model-based scoring methods.
  • Use a custom keyword dictionary to adjust keyword sensitivity.
  • Input files can be in CSV or JSON format.
  • Saves metric results in a user-defined JSON output file.

Usage

Command Line Arguments

Argument Description Required
input_file Path to the input file (CSV or JSON) containing the texts to evaluate. Yes
output_dir Path to the output directory where the computed metrics will be stored. Yes
dict_path (Optional) Path to a custom JSON keyword dictionary for keyword-based metrics. Defaults to PALGA thesaurus No
metrics List of metrics to compute. Available options: jaccard_index_dictionary, jaccard_index_scispacy, cosine_similarity_biobert, machine_translation_metrics, bert_score_metric. Defaults to all metrics if not specified. No

Input File Format

The input file should contain two columns labeled original and generated. The original column contains reference texts, and the generated column contains the corresponding texts generated by the LLM. The input file can be in CSV or JSON format.

Example CSV Format

original,generated
original_text1,generated_text1
original_text2,generated_text2

Example JSON Format

{
  "original": ["original_text1", "original_text2"],
  "generated": ["generated_text1", "generated_text2"]
}

Available Metrics

  • jaccard_index_dictionary: Calculates the Jaccard Index based on a custom dictionary of keywords.
  • jaccard_index_scispacy: Uses the SciSpacy model to calculate the Jaccard Index, focusing on biomedical entities.
  • cosine_similarity_biobert: Measures cosine similarity using BioBERT embeddings, suitable for biomedical text comparisons.
  • machine_translation_metrics: Evaluates using machine translation metrics like BLEU (1-4), ROUGE, CIDER, METEOR, commonly used in translation quality assessment.
  • bert_score_metric: Computes BERTScore, which uses contextual embeddings for a more semantic similarity measure.

Example Command

To run the evaluation with all metrics and the default dictionary:

python src/evaluate.py  path_to_input/input.csv  path_to_output

To specify specfic metrics and use a custom dictionary:

python evaluate.py  path/to/input.json  path/to/output_dir/ --dict-path path/to/dictionary.json  --metrics bert_score_metric --metrics machine_translation_metics

Output

There will be two JSON output files generated results.json which contains the results for each metric computed on each input text pair, as well as results_averaged.json which contains the results averaged results across all report pairs computed for each metrics. For example for the default configuration, results_averaged.json will look like this:

{
  "jaccard_index_dictionary_mean": "score",
  "jaccard_index_scispacy_mean": "score",
  "cosine_similarity_biobert_mean": "score",
  "BLEU-1_mean": "score",
  "BLEU-2_mean": "score",
  "BLEU-3_mean": "score",
  "BLEU-4_mean": "score",
  "Cider_mean": "score",
  "ROUGE-L_mean": "score",
  "Meteor_mean": "score",
  "Bertscore_p_mean": "score",
  "Bertscore_r_mean": "score",
  "Bertscore_f1_mean": "score",
  
}

Local or for ARM users

Steps to Install Locally: Install Conda (e.g., Miniconda or Anaconda).

Create the Environment:

conda env create -f environment.yml

Activate the Environment:

conda activate translator

Run the Script:

python src/evaluate.py  path_to_input/input.csv  path_to_output_dir/

Building Docker Image

Navigate to the Dockerfile and then run:

docker build -t evaluator:latest .

Running the script with SLURM

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=4:00:00
#SBATCH --container-mounts=/path_to_data_dir/:/data_dir/
#SBATCH --container-image="tag"

# Run the Python script with arguments
python src/evaluate.py /data_dir/data.csv /data_dir/output_dir/ 

License

This project is licensed under the MIT License.

About

# Evaluation Metrics for Large Language Models (LLMs) in the Medical Field

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors