SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering
SCARE is a comprehensive benchmark designed to evaluate post-hoc safety layers for Electronic Health Record (EHR) question answering systems. The benchmark addresses the critical need for reliable verification and correction of SQL queries generated by text-to-SQL models.
Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging. Incorrect SQL queries—whether caused by model errors or problematic user inputs—can undermine clinical decision making and jeopardize patient care.
While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of standardized benchmark for evaluating independent post-hoc verification mechanisms, which are crucial for safe deployment.
To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as post-hoc safety layers in EHR QA systems. SCARE evaluates the joint task of:
- Question Answerability Classification: Determining whether a question is answerable, ambiguous, or unanswerable
- SQL Query Verification and Correction: Verifying or correcting candidate SQL queries
The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in three publicly available EHR databases:
- MIMIC-III: Medical Information Mart for Intensive Care III
- MIMIC-IV: Medical Information Mart for Intensive Care IV
- eICU: eICU Collaborative Research Database
The dataset covers a diverse set of questions (answerable, ambiguous, unanswerable) and corresponding candidate SQL queries generated by seven different text-to-SQL models.
├── correction-data/ # Final question–SQL–answer tuples
│ ├── eicu_test_set_correction_data.json
│ ├── mimic_iv_test_set_correction_data.json
│ ├── mimicsql_test_set_correction_data.json
├── data_final/ # Original question datasets
│ ├── *_test.json
│ ├── *_ans_test.json
│ ├── *_unans_test.json
│ └── *_ambig_test.json
├── databases/ # Database files and schemas (download required)
├── prompt/ # Prompts for baseline approaches
│ ├── multi_turn_correction/
│ ├── single_turn_correction/
│ ├── two_stage_pipeline/
│ └── verifier_correction/
├── scripts/ # Evaluation scripts
├── src/ # Source code
│ ├── evaluate_sql_correction.py
│ ├── multi_turn_correction.py
│ ├── multi_turn_correction_cls.py
│ ├── single_turn_correction.py
│ ├── single_turn_correction_cls.py
│ ├── single_turn_two_stage.py
│ ├── verifier_correction.py
│ ├── verifier_correction_cls.py
│ └── utils/
├── output/ # Model outputs
├── output_eval/ # Evaluation results
└── requirements.txt # Python dependencies
The database files are not included in this repository due to their size. You need to download them separately:
- Download the database files from Google Drive: Download databases.zip
- Unzip the downloaded file in the project root directory:
unzip databases.zipThis will create the databases/ directory containing the EHR database files (eICU, MIMIC-IV, MIMIC-SQL) required for evaluation.
To run a model on the SCARE benchmark, run the correction method on each dataset. Here's an example using the single_turn_two_stage approach with Gemini-2.5-Flash:
export model='gemini/gemini-2.5-flash'
# MIMIC-SQL (MIMIC-III)
export db_id='mimicsql'
python src/single_turn_two_stage.py \
--db_id ${db_id} \
--model_name "${model}" \
--num_process 30
# MIMIC-IV
export db_id='mimic_iv'
python src/single_turn_two_stage.py \
--db_id ${db_id} \
--model_name "${model}" \
--num_process 30
# eICU
export db_id='eicu'
python src/single_turn_two_stage.py \
--db_id ${db_id} \
--model_name "${model}" \
--num_process 30For additional examples with different models (GPT, Gemini, Llama, Qwen) and correction methods, see the scripts/ directory.
After running the correction method on all datasets, generate evaluation metrics:
dataset='mimic_iv,eicu,mimicsql'
python src/evaluate_sql_correction.py \
--model_name "gemini-2.5-flash" \
--dataset_name_list ${dataset} \
--corrector_name single_turn_two_stage