Skip to content

glee4810/SCARE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Overview

SCARE is a comprehensive benchmark designed to evaluate post-hoc safety layers for Electronic Health Record (EHR) question answering systems. The benchmark addresses the critical need for reliable verification and correction of SQL queries generated by text-to-SQL models.

Abstract

Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging. Incorrect SQL queries—whether caused by model errors or problematic user inputs—can undermine clinical decision making and jeopardize patient care.

While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of standardized benchmark for evaluating independent post-hoc verification mechanisms, which are crucial for safe deployment.

To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as post-hoc safety layers in EHR QA systems. SCARE evaluates the joint task of:

  1. Question Answerability Classification: Determining whether a question is answerable, ambiguous, or unanswerable
  2. SQL Query Verification and Correction: Verifying or correcting candidate SQL queries

Dataset

The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in three publicly available EHR databases:

  • MIMIC-III: Medical Information Mart for Intensive Care III
  • MIMIC-IV: Medical Information Mart for Intensive Care IV
  • eICU: eICU Collaborative Research Database

The dataset covers a diverse set of questions (answerable, ambiguous, unanswerable) and corresponding candidate SQL queries generated by seven different text-to-SQL models.

Repository Structure

├── correction-data/          # Final question–SQL–answer tuples
│   ├── eicu_test_set_correction_data.json
│   ├── mimic_iv_test_set_correction_data.json
│   ├── mimicsql_test_set_correction_data.json
├── data_final/               # Original question datasets
│   ├── *_test.json
│   ├── *_ans_test.json
│   ├── *_unans_test.json
│   └── *_ambig_test.json
├── databases/                # Database files and schemas (download required)
├── prompt/                   # Prompts for baseline approaches
│   ├── multi_turn_correction/
│   ├── single_turn_correction/
│   ├── two_stage_pipeline/
│   └── verifier_correction/
├── scripts/                  # Evaluation scripts
├── src/                      # Source code
│   ├── evaluate_sql_correction.py
│   ├── multi_turn_correction.py
│   ├── multi_turn_correction_cls.py
│   ├── single_turn_correction.py
│   ├── single_turn_correction_cls.py
│   ├── single_turn_two_stage.py
│   ├── verifier_correction.py
│   ├── verifier_correction_cls.py
│   └── utils/
├── output/                   # Model outputs
├── output_eval/              # Evaluation results
└── requirements.txt          # Python dependencies

Setup

Database Setup

The database files are not included in this repository due to their size. You need to download them separately:

  1. Download the database files from Google Drive: Download databases.zip
  2. Unzip the downloaded file in the project root directory:
unzip databases.zip

This will create the databases/ directory containing the EHR database files (eICU, MIMIC-IV, MIMIC-SQL) required for evaluation.

How to Evaluate Models

Running Baseline Methods

To run a model on the SCARE benchmark, run the correction method on each dataset. Here's an example using the single_turn_two_stage approach with Gemini-2.5-Flash:

export model='gemini/gemini-2.5-flash'

# MIMIC-SQL (MIMIC-III)
export db_id='mimicsql'
python src/single_turn_two_stage.py \
    --db_id ${db_id} \
    --model_name "${model}" \
    --num_process 30

# MIMIC-IV
export db_id='mimic_iv'
python src/single_turn_two_stage.py \
    --db_id ${db_id} \
    --model_name "${model}" \
    --num_process 30

# eICU
export db_id='eicu'
python src/single_turn_two_stage.py \
    --db_id ${db_id} \
    --model_name "${model}" \
    --num_process 30

For additional examples with different models (GPT, Gemini, Llama, Qwen) and correction methods, see the scripts/ directory.

Generating Evaluation Results

After running the correction method on all datasets, generate evaluation metrics:

dataset='mimic_iv,eicu,mimicsql'
python src/evaluate_sql_correction.py \
    --model_name "gemini-2.5-flash" \
    --dataset_name_list ${dataset} \
    --corrector_name single_turn_two_stage

About

Code and Data for SCARE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors