SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Overview

SCARE is a comprehensive benchmark designed to evaluate post-hoc safety layers for Electronic Health Record (EHR) question answering systems. The benchmark addresses the critical need for reliable verification and correction of SQL queries generated by text-to-SQL models.

Abstract

Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging. Incorrect SQL queries—whether caused by model errors or problematic user inputs—can undermine clinical decision making and jeopardize patient care.

While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of standardized benchmark for evaluating independent post-hoc verification mechanisms, which are crucial for safe deployment.

To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as post-hoc safety layers in EHR QA systems. SCARE evaluates the joint task of:

Question Answerability Classification: Determining whether a question is answerable, ambiguous, or unanswerable
SQL Query Verification and Correction: Verifying or correcting candidate SQL queries

Dataset

The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in three publicly available EHR databases:

MIMIC-III: Medical Information Mart for Intensive Care III
MIMIC-IV: Medical Information Mart for Intensive Care IV
eICU: eICU Collaborative Research Database

The dataset covers a diverse set of questions (answerable, ambiguous, unanswerable) and corresponding candidate SQL queries generated by seven different text-to-SQL models.

Repository Structure

├── correction-data/          # Final question–SQL–answer tuples
│   ├── eicu_test_set_correction_data.json
│   ├── mimic_iv_test_set_correction_data.json
│   ├── mimicsql_test_set_correction_data.json
├── data_final/               # Original question datasets
│   ├── *_test.json
│   ├── *_ans_test.json
│   ├── *_unans_test.json
│   └── *_ambig_test.json
├── databases/                # Database files and schemas (download required)
├── prompt/                   # Prompts for baseline approaches
│   ├── multi_turn_correction/
│   ├── single_turn_correction/
│   ├── two_stage_pipeline/
│   └── verifier_correction/
├── scripts/                  # Evaluation scripts
├── src/                      # Source code
│   ├── evaluate_sql_correction.py
│   ├── multi_turn_correction.py
│   ├── multi_turn_correction_cls.py
│   ├── single_turn_correction.py
│   ├── single_turn_correction_cls.py
│   ├── single_turn_two_stage.py
│   ├── verifier_correction.py
│   ├── verifier_correction_cls.py
│   └── utils/
├── output/                   # Model outputs
├── output_eval/              # Evaluation results
└── requirements.txt          # Python dependencies

Setup

Database Setup

The database files are not included in this repository due to their size. You need to download them separately:

Download the database files from Google Drive: Download databases.zip
Unzip the downloaded file in the project root directory:

unzip databases.zip

This will create the databases/ directory containing the EHR database files (eICU, MIMIC-IV, MIMIC-SQL) required for evaluation.

How to Evaluate Models

Running Baseline Methods

To run a model on the SCARE benchmark, run the correction method on each dataset. Here's an example using the single_turn_two_stage approach with Gemini-2.5-Flash:

export model='gemini/gemini-2.5-flash'

# MIMIC-SQL (MIMIC-III)
export db_id='mimicsql'
python src/single_turn_two_stage.py \
    --db_id ${db_id} \
    --model_name "${model}" \
    --num_process 30

# MIMIC-IV
export db_id='mimic_iv'
python src/single_turn_two_stage.py \
    --db_id ${db_id} \
    --model_name "${model}" \
    --num_process 30

# eICU
export db_id='eicu'
python src/single_turn_two_stage.py \
    --db_id ${db_id} \
    --model_name "${model}" \
    --num_process 30

For additional examples with different models (GPT, Gemini, Llama, Qwen) and correction methods, see the scripts/ directory.

Generating Evaluation Results

After running the correction method on all datasets, generate evaluation metrics:

dataset='mimic_iv,eicu,mimicsql'
python src/evaluate_sql_correction.py \
    --model_name "gemini-2.5-flash" \
    --dataset_name_list ${dataset} \
    --corrector_name single_turn_two_stage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Overview

Abstract

Dataset

Repository Structure

Setup

Database Setup

How to Evaluate Models

Running Baseline Methods

Generating Evaluation Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
correction-data		correction-data
data_final		data_final
prompt		prompt
scripts		scripts
src		src
README.md		README.md
SCARE.pdf		SCARE.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Overview

Abstract

Dataset

Repository Structure

Setup

Database Setup

How to Evaluate Models

Running Baseline Methods

Generating Evaluation Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages