We present the SimSUM benchmark, a simulated dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and related notes describing the fictional patient encounter in the domain of respiratory diseases. For more information about the generation of the data, we refer to the paper.
Should you decide to use SimSUM for your own research, please cite:
@article{rabaey2025simsum,
title={SimSUM--simulated benchmark with structured and unstructured medical records},
author={Rabaey, Paloma and Heytens, Stefan and Demeester, Thomas},
journal={Journal of Biomedical Semantics},
volume={16},
number={1},
pages={20},
year={2025},
publisher={Springer}
}
Data access To access the SimSUM dataset, please download the csv file from this repository. The dataset contains the following variables (corresponding column names are between brackets):
- diagnoses: pneumonia (
pneu) and common cold (cold) - symptoms: dyspnea (
dysp), cough (cough), pain (pain), fever (fever), nasal (nasal) - underlying conditions: asthma (
asthma), smoking (smoking), COPD (COPD), hay fever (hay_fever) - external influence (non-clinical): policy (
policy), self-employed (self_empl), season (season) - treatment: antibiotics (
antibiotics) - outcome: days at home (
days_at_home) - text note (
text): clinical note describing the patient encounter - compact text note (
advanced_text): more compact (and therefore more difficult) version of the note intext
Potential use The SimSUM dataset is primarily designed to facilitate research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text - the symptoms, in the case of SimSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. Importantly, SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting.
Data generating process The figure below describes the full data generating process. First, the tabular portion of the synthetic patient record is sampled from a Bayesian network, where both the structure and the conditional probability distributions were defined by an expert. Afterwards, we construct a prompt containing information on the symptoms experienced by the patient, as well as their underlying health conditions (but no diagnoses). We ask the GPT-4o large language model to generate a fictional clinical note describing this patient encounter. Finally, we ask to generate a more challenging compact version of the note, mimicking the complexity of real clinical notes by prompting the use of abbreviations and shortcuts. We generate 10.000 of these synthetic patient records in total.
src folder:
data_generating_process.py: Contains theRespiratoryDataclass, in which the expert-defined Bayesian network is constructed. This defines the data generating process from which our tabular patient records were sampled.text_generation.ipynb: Demonstrates how text prompts are created based on the information in the tabular patient record. These prompts were fed to GPT-4o to generate the clinical text notes in our dataset.symptom_predictor_baselines.ipynb: Demonstrates how we ran some simple symptom predictor baselines on the SimSUM dataset. There are two tabular baselines (BN-tab and XGBoost-tab), one text-only neural classifier (neural-text) and one neural classifier that also sees the tabular features at the input (neural-text-tab).expert_evaluation.ipynb: Presents the results of our expert evaluation, where five experts rated the notes on several aspects, including consistency with the prompt and realism of the added context.span_annotations.ipynb: Describes our automated pipeline for annotating the clinical notes with spans pinpointing in which exact phrase(s) each symptom is mentioned in the note.
utils folder:
prompt_generation.py: Helper functions for generating the text prompts from the tabular patient records.bayesian_network.py: Helper functions for learning the Bayesian network parameters from the data, used for training the BN-tab baseline.neural_classifier.py: Helper functions for learning the neural classifier weights, used for training the neural-text and neural-text-tab baselines.
data folder:
df_simsum.p: Pickled dataframe containing 10.000 patient records, including both tabular features and clinical text notes (the same as the csv file, but in Pandas format).df_prompts.p: Pickled dataframe containing an extended version of all patient records, including the prompts and the information that was used to construct these (mentions of symptoms, descriptors of symptoms, etc.)df_train.panddf_test.p: Train (8000 samples) and test (2000 samples) sets used to train and evaluate the baseline models insymptom_predictor_baselines.ipynb.spansfolder: Span annotations of the specific phrases in which each symptom is mentioned in the notes.normal_span_annotations.jsoncontains these annotations for the normal notes, whileadv_span_annotations.jsoncontains these for the compact advanced notes. Each of the 10.000 entries contains a list of extracted spans, where each span is made up of the following attributes: symptom ("symptom"), extracted phrase ("text") and start and end index for where this phrase can be found in the note ("start" and "end").embfolder: Train and test dataframes containing the precomputed embeddings of thehistoryandphysical examinationportions of the text notes. These are used to train the neural-text and neural-text-tab baselines insymptom_predictor_baselines.ipynb.emb/spanfolder: Contains precomputed embeddings for the extracted spans. File{sympt}_{compl}_ann_emb.npy(e.g. "dysp_normal_ann_emb.npy") contains the average span embeddings (in order, from index 0 to 9999) for symptom "sympt" and note type "compl".
results folder: Contains dataframes with the results of the symptom predictor baseline models when evaluated over the test set, for all symptoms and various settings. For more information, see the notebook symptom_predictor_baselines.
eval folder:
notes/Evaluator_{i}.csv: CSV file containing the scores assigned by evaluator i in our expert evaluation. For more information, see the notebookexpert_evaluation.ipynb.notes/Instructions_clinical_evaluation.pdf: File with instructions on how to conduct the clinical evaluation, which was carefully explained to each expert evaluator.spans/span_eval_dataset.txt: Random subset of 100 notes for which the automatically extracted symptom spans were manually evaluated.spans/span_eval_results.csv: Results of the manual evaluation of the extracted symptom spans.
