An interactive platform for Clinical Practice Research Datalink (CPRD) Aurum data extraction, code list development, and cohort assembly.
Developed by Dr Milad Nazarzadeh Nuffield Department of Women's & Reproductive Health, University of Oxford
CPRD Extractor is an open-source, browser-based application that provides a complete workflow for researchers working with CPRD Aurum electronic health record data. It unifies clinical code list development, high-performance data extraction, linked dataset querying, and cohort assembly into a single interactive platform.
The tool is designed to run on institutional high-performance computing (HPC) clusters (tested on the Oxford BMRC environment), generic Linux servers, and Windows workstations. It includes a built-in mock data mode for testing and demonstration without requiring CPRD data access.
- Code List Development β A structured six-stage pipeline for building, validating, and auditing clinical code lists (SNOMED CT, ICD-10, Read, medcodeid).
- Drug Lookup β A curated library of 315 cardiovascular and cardiometabolic drugs across 18 therapeutic classes with automated CPRD Product Dictionary matching.
- Disease Library β Pre-built SNOMED CT and ICD-10 code sets for 50+ cardiovascular conditions.
- High-Performance Extraction β DuckDB-powered parallel extraction from CPRD Aurum primary care files (Observation, DrugIssue, Patient, Consultation, Problem, Referral, Staff, Practice).
- Linked Data Support β Extraction from HES Admitted Patient Care, HES Outpatient, HES A&E, ONS Mortality, and Index of Multiple Deprivation.
- Cohort Builder β Interactive inclusion/exclusion criteria with attrition flowcharts and cross-extraction patient linking.
- HPC Integration β Automatic Slurm job array script generation for cluster-scale extraction with one-command launch.
The home screen provides an overview of the data environment, configured paths, and available modules.
A six-stage pipeline guides the user from defining a clinical feature of interest, through synonym generation and code browser searches, to CPRD EMIS Dictionary matching with full audit trail and Excel export.
Search and browse 315 cardiovascular/cardiometabolic drugs by therapeutic class, generic name, or brand name. Matched product codes (prodcodeids) are returned for direct use in CPRD DrugIssue extraction.
Extract records from Observation, DrugIssue, Consultation, and other CPRD Aurum file types. Select conditions from the built-in disease library or enter custom SNOMED CT/medcode codes. Real-time progress tracking shows folder-level scanning status.
Define inclusion and exclusion criteria interactively. The attrition flow visualises each filtering step with patient counts. Summary statistics (sex, age, follow-up, linkage eligibility) are computed automatically.
- Python 3.9 or later
- pip (Python package manager)
# Clone the repository
git clone https://github.com/miladnazarzadeh/CprdExtractor.git
cd CprdExtractor
# Install dependencies
pip install -r requirements.txt
# Launch the application
streamlit run app.pyThe application opens in your default browser at http://localhost:8501.
On clusters where internet access is restricted, install dependencies from a login node or pre-built environment:
module load Python/3.11.3-GCCcore-12.3.0
pip install --user -r requirements.txt
# Forward a port from the cluster to your local machine
ssh -L 8501:localhost:8501 username@bmrc-server
# On the cluster
streamlit run app.py --server.port 8501 --server.headless trueThen open http://localhost:8501 in your local browser.
Use the sidebar to choose your environment:
| Mode | Description |
|---|---|
| π§ͺ Mock Data | Synthetic data for testing and demonstration (default) |
| π¬ Live β BMRC | Pre-configured paths for the Oxford BMRC cluster |
| π₯οΈ Live β Any Server | Specify a custom CPRD data root directory |
| π» Windows | Windows-compatible mode with native path handling |
Navigate to π Code List Development to build a clinical code list:
- Define the clinical feature of interest
- Generate synonyms and identify existing published code lists
- Search code browsers (SNOMED CT, ICD-10)
- Review and classify candidate codes
- Match against the CPRD EMIS Medical Dictionary to obtain medcodeids
- Export a clinician-review questionnaire with full audit trail
Navigate to 𧬠CPRD Aurum Extraction or π Linkage Extraction:
- Select a condition from the Disease Library (50+ cardiovascular conditions) or enter custom codes
- Optionally scope the extraction to patients from a previous extraction
- Click Extract and monitor real-time progress
- Results are displayed in-app and auto-saved to disk in Parquet or CSV format
Navigate to π₯π₯ Cohort Builder to apply eligibility criteria:
- Set age ranges, registration requirements, and linkage eligibility filters
- View the attrition flowchart with patient counts at each step
- Export the final cohort for downstream analysis
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit Web Interface β
ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββββββ€
β Code Listβ Drug β Aurum β Linkage β Cohort β
β Dev β Lookup βExtractionβExtractionβ Builder β
ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββββββββ€
β CPRDEngine Core β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββββββ β
β β DuckDB SQL β β Mock Data β β Slurm CLI Mode β β
β β (parallel) β β Generator β β (task arrays) β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Layer (Read-Only) β
β CPRD Aurum β HES APC/OP/A&E β ONS Mortality β IMD β EMIS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Description |
|---|---|
CPRDEngine |
Central engine handling practice folder discovery, file scanning, DuckDB-based extraction, and mock data generation |
Code List Development |
Six-stage pipeline with EMIS Dictionary matching, expansion analysis, and clinical review questionnaire generation |
Drug Code Library |
315 drugs with generic names, brand names, BNF codes, therapeutic classes, and CPRD Product Dictionary search terms |
Disease Library |
50+ cardiovascular conditions with validated SNOMED CT and ICD-10 code sets |
Slurm Integration |
CLI mode for HPC job arrays with automatic script generation, shard-based parallelism, and merge pipeline |
| Module | Navigation Label | Description |
|---|---|---|
| Home | π Home | Dashboard with data environment status and path verification |
| Code List Dev | π Code List Development | Six-stage code list creation and EMIS Dictionary matching |
| Drug Lookup | π§ͺ Drug Lookup | Search 315 drugs by class, name, or BNF code |
| Quick Extract | β Quick Extract (Newbie) | Simplified one-click extraction for new users |
| Demographics | π€ Demographics | Sex, age, IMD, ethnicity, and registration period extraction |
| Aurum Extraction | 𧬠CPRD Aurum Extraction | Primary care data extraction by SNOMED, medcode, or prodcode |
| Linkage Extraction | π Linkage Extraction | HES APC, HES OP, HES A&E, ONS Death, and IMD extraction |
| Multi-Source | π Multi-Source Search | Simultaneous search across all CPRD and linked datasets |
| Cohort Builder | π₯π₯ Cohort Builder | Inclusion/exclusion criteria with attrition flow |
| Analytics | π Analytics | Descriptive statistics, temporal trends, and visualisations |
| Definitions | π Definitions | Reference glossary for CPRD-specific terminology |
| Configuration | βοΈ Configuration | Path management, output settings, and SSH connection panel |
For large-scale extractions on HPC clusters, the application generates Slurm job array scripts:
# Generate Slurm scripts
python app.py --generate_slurm \
--extract_type snomed \
--codes "60573004,86466006,83916000" \
--total_tasks 50 \
--output_dir /path/to/output
# One-command launch (submits extraction array + merge job)
cd /path/to/output
bash cprd_snomed_launch.sh
# Or step-by-step
sbatch cprd_snomed.sh # Submit array (50 parallel tasks)
sbatch --dependency=afterok:$JOB_ID cprd_snomed_merge.sh # Merge shardsEach task processes a subset of practice folders in parallel. The merge job combines all Parquet shards into a single output file.
| Format | Extension | Advantages |
|---|---|---|
| Parquet (default) | .parquet |
~5Γ smaller, ~10Γ faster to read, preserves data types |
| CSV | .csv |
Universal compatibility, human-readable |
| Both | .parquet + .csv |
Maximum flexibility |
All outputs can optionally include human-readable code descriptions merged from the EMIS Medical Dictionary.
The built-in disease library includes validated SNOMED CT and ICD-10 code sets for:
- Coronary Heart Disease β Stable angina, unstable angina, NSTEMI, STEMI, chronic coronary syndrome
- Valvular Heart Disease β Aortic stenosis/regurgitation, mitral stenosis/regurgitation/prolapse, tricuspid and pulmonary valve disease, rheumatic heart disease
- Arrhythmias β Atrial fibrillation/flutter, SVT, VT/VF, bradycardia, heart block, long QT, Brugada, WPW
- Cardiomyopathies β Dilated, hypertrophic, restrictive, ARVC, Takotsubo, peripartum, amyloid, sarcoid
- Vascular Disease β PAD, carotid disease, aortic aneurysm/dissection, DVT, PE, renal artery stenosis
- Congenital Heart Disease β ASD, VSD, coarctation, Tetralogy of Fallot, TGA, HLHS, PDA
- Heart Failure β All subtypes with 50+ SNOMED codes
- Infectious/Inflammatory β Endocarditis, myocarditis, pericarditis, Kawasaki, Chagas
315 cardiovascular and cardiometabolic drugs across 18 therapeutic classes:
Antiarrhythmics, Anticoagulants, Antihypertensives (ACE inhibitors, ARBs, CCBs, beta-blockers, diuretics, MRAs), Antiplatelets, Cardiac Amyloidosis agents, Critical Care & Vasoactive agents, Diabetes/Glucose-Lowering agents (SGLT2i, GLP-1 RA, DPP-4i, insulin), Heart Failure agents (ARNI, ivabradine, vericiguat), Lipid-Lowering agents (statins, PCSK9i, inclisiran, bempedoic acid), Nitrates & Antianginals, Obesity (CV-relevant), Pericarditis & Inflammatory, Peripheral Vascular Disease, Potassium Management, Pulmonary Hypertension, and Thrombolytics.
| Package | Minimum Version | Purpose |
|---|---|---|
streamlit |
1.28.0 | Web interface |
pandas |
1.5.0 | Data manipulation |
numpy |
1.23.0 | Numerical operations |
duckdb |
0.9.0 | High-performance SQL extraction |
pyarrow |
12.0.0 | Parquet I/O |
openpyxl |
3.1.0 | Excel export (code list questionnaires) |
plotly |
5.15.0 | Interactive visualisations |
If you use CPRD Extractor in your research, please cite:
@software{nazarzadeh2026cprdextractor,
author = {Nazarzadeh, Milad},
title = {{CPRD Extractor: An Interactive Platform for Clinical Practice
Research Datalink Data Extraction and Cohort Assembly}},
year = {2026},
url = {https://github.com/miladnazarzadeh/CprdExtractor},
version = {1.0.0},
institution = {Nuffield Department of Women's and Reproductive Health,
University of Oxford}
}Contributions are welcome. Please read CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License.
Note: This software facilitates extraction from CPRD data. It does not distribute or contain any patient data. Users must hold a valid CPRD data licence and comply with all applicable data governance requirements.
This tool was developed as part of the HEART-MIND Programme at the University of Oxford, supported by the Nuffield Department of Women's & Reproductive Health.




