This repository introduces the ofhsim R library that creates synthetic datasets, aiming to mimic the structure of multiple linked health data sources (participant, questionnaire, clinic measures, HES-like events, primary care medicines, deaths, and geography) available in the Our Future Health cohort. It is designed so researchers can develop and test workflows before running them on the much larger real data.
This project is designed to generate synthetic data from rule-based and stochastic generators, schema dictionaries, and user-supplied code lists, rather than from participant-level OFH source records.
The output data is a general approximation of the schema intended for code development, not for statistical inference.
This is an unofficial tool and is not affiliated with Our Future Health.
The generator produces CSV files for a synthetic cohort, including:
- participant data
- questionnaire data
- clinic measurements
- outpatient, inpatient, emergency, and deaths HES data
- primary care medication data
- country/region data
All datasets are linked by pid.
The sample size, diagnostic codes, and BNF drug codes are determined by the user. These codes are then sampled and distributed using configurable probabilities (for cohort coverage, record counts, and field-level missingness) across the inpatient, outpatient, A&E, and primary care medicines datasets.
Minimum:
- R 4.2+
For vignette building/preview:
rmarkdownknitr
Install vignette dependencies (if needed):
install.packages(c("rmarkdown", "knitr"))git clone https://github.com/hlnicholls/OFHSimulatedData
cd ./OFHSimulatedData
install.packages(".", repos = NULL, type = "source")
library(ofhsim)
sim <- OFHCohortSimulator$new(project_root = ".", seed = 123)
out <- sim$run_all(n = 1000)You control which codes appear in outputs, and how frequently records are generated.
- Set code pools with
icd10,opcs4, andbnf_codes(or*_fileinputs). - Set dataset-level coverage and record volume with
proportionsandrecord_multipliers.- E.g., setting
proportions = list(nhse_outpat = 0.25)andrecord_multipliers = list(nhse_outpat = 1.2)forn = 1000gives about 250 unique outpatient participants and about 1,200 outpatient rows (including repeat visits).
- E.g., setting
- Set within-dataset code proportionality (relative code frequencies) with
code_configweights, for exampleicd10_weights,opcs4_weights,primary_icd10_weights,underlying_icd10_weights,ae_specific_weights, andread_weights.- E.g., setting
code_config = list(nhse_outpat_data = list(icd10_weights = c(I210 = 5, I500 = 1)))gives roughly a 5:1 ratio ofI210toI500among sampled outpatient ICD-10 codes (subject to normal random variation).
- E.g., setting
ICD10 code files should use this structure (CSV columns, or tab-separated TXT with same columns):
| code | description |
|---|---|
| I210 | STEMI of anterolateral wall |
| I500 | Congestive heart failure |
| N189 | Chronic kidney disease unspecified |
OPCS4 code files should use this structure (CSV columns, or tab-separated TXT with same columns):
| code | description |
|---|---|
| K401 | Percutaneous transluminal balloon angioplasty of coronary artery |
| K451 | Insertion of drug-eluting stent into coronary artery |
| K561 | Repair of heart valve |
BNF codes should use this structure:
| BNFCode | BNFName | Formulation | Strength |
|---|---|---|---|
| ZZZ0001AA | Custom Test Drug A | tablets | 10 mg |
| ZZZ0002BB | Custom Test Drug B | inhaler | 100 mcg |
| ZZZ0003CC | Custom Test Drug C | capsules | 50 mg |
library(ofhsim)
out <- generate_ofh_cohort(
n = 1000,
seed = 123,
icd10 = c(
I210 = "STEMI of anterolateral wall",
I500 = "Congestive heart failure"
),
opcs4 = c(
K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"
),
bnf_codes = data.frame(
BNFCode = c("0212000B0", "0601023A0"),
BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
Formulation = c("tablets", "tablets"),
Strength = c("20 mg", "500 mg"),
stringsAsFactors = FALSE
)
)
# or load code lists from files
# - ICD10/OPCS4: include both code and description
# - TXT: tab-separated columns (code\tdescription)
# - CSV: 'code,description' columns
# - BNF: use a CSV with columns BNFCode, BNFName, Formulation (optional Strength)
out <- generate_ofh_cohort(
n = 1000,
seed = 123,
icd10_file = "icd10_codes.txt",
opcs4_file = "opcs4_codes.csv",
bnf_codes_file = "bnf_medications.csv"
)
# return objects only in the R environment (no CSV files written)
out_objects_only <- generate_ofh_cohort(
n = 1000,
seed = 123,
save_csv = FALSE,
return_objects = TRUE
)
# write CSV files only (no returned R objects)
generate_ofh_cohort(
n = 1000,
seed = 123,
save_csv = TRUE,
return_objects = FALSE,
output_dir = "example"
)
# customize generation probabilities
out_custom_probs <- generate_ofh_cohort(
n = 1000,
seed = 123,
proportions = list(
nhse_outpat = 0.25,
nhse_inpat = 0.20,
nhse_ed = 0.30,
nhse_primcare_meds = 0.75
),
record_multipliers = list(
nhse_outpat = 1.2,
nhse_inpat = 1.1,
nhse_ed = 1.3
),
code_config = list(
nhse_outpat_data = list(diag_4_02_missing_prob = 0.70),
nhse_inpat_data = list(single_diag_prob = 0.85)
),
save_csv = FALSE,
return_objects = TRUE
)