Skip to content

addding an llms.txt to improved mcp usage #14

Open
zwelshman wants to merge 1 commit intomainfrom
adding-llms.txt
Open

addding an llms.txt to improved mcp usage #14
zwelshman wants to merge 1 commit intomainfrom
adding-llms.txt

Conversation

@zwelshman
Copy link
Copy Markdown
Contributor

BHF Data Science Centre – Health Data Science Documentation

Documentation for the BHF Data Science Centre (HDR UK) Health Data Science team.
Covers NHS England Secure Data Environment (SDE) datasets, curated data assets,
curated phenotypes, and supporting resources for cardiovascular and population
health research using linked NHS administrative data.

The repository is used by researchers and analysts working within the
CVD-COVID-UK/COVID-IMPACT instance of the NHS England SDE. It provides
guidance on available datasets, their limitations and compilation methods,
curated assets that are refreshed quarterly, curated phenotypes, and tooling
resources (codelists, lookup/mapping tables, phenotype library workflows).

Dataset Overview

High-level inventory of all provisioned datasets in the NHS England SDE, with
coverage plots, table names/paths, and a summary table (PDF).

Dataset Insights

In-depth documentation for individual datasets: limitations, compilation
methods, coding systems used, and data quality considerations.

  • GDPPR: General Practice Extraction Service data; ~61m patients; SNOMED-CT coded; code cluster coverage explained
  • GDPPR SNOMED Subset Analysis: Analysis of GDPPR SNOMED coverage vs full SNOMED universe (3.7% of codes)
  • Emergency Care Data Set (ECDS): SNOMED-CT coded emergency department data; complete from April 2020
  • HES Critical Care (HES CC): ICU/HDU episode data; CCPERTYPE variable governs row structure
  • ICNARC: Intensive care national audit data for critically ill COVID-19 patients
  • NICOR TAVI: Transcatheter Aortic Valve Implantation registry; 24,685 individuals from 2018
  • Medicines Dispensed in Primary Care: Community pharmacy dispensing; SNOMED-CT DM+D and BNF coded; from April 2018
  • SSNAP: Sentinel Stroke National Audit Programme; near real-time stroke care data from 2018
  • SSNAP Collapsing Methodology: Five-step method to collapse multiple rows per stroke incidence to one row per person

Curated Data

Cleaned, tidied, and reformatted versions of raw datasets designed to simplify
downstream curation or analysis. Includes long-format procedure/diagnosis tables.

  • Curated Data Overview: Definition and goals of curated data; null removal, format standardisation, table restructuring
  • Patient IDs: NHS_NUMBER_DEID vs PERSON_ID_DEID; MPS derivation; token_pseudo_id_lookup table
  • HES APC (Curated Data): Parent page for HES APC curated data children
  • HES APC Operations: Long-format OPCS-4 procedure coding for HES APC; EPISTART and OPERTN_DATE fields
  • ECDS (Curated Data): Long-format ECDS diagnosis, investigation, and treatment tables; SNOMED-CT coded

Curated Assets

Processed datasets combining multiple sources to extract specific variables of
interest. Refreshed quarterly; stored in dsa_391419_j3w9t_collab schema.
Loaded via PySpark using archived date suffix (YYYY_MM_DD).

  • Curated Assets Overview: What curated assets are, versioning approach, and quarterly refresh cycle
  • Key Patient Characteristics (KPC): Standardised demographics (DOB, sex, ethnicity, LSOA); multisource and individual tables
  • KPC – How to Use: Table naming conventions and PySpark loading examples for demographics and KPC tables
  • KPC – Methodology: Selection algorithm across GDPPR, HES APC/OP/AE, SSNAP, Vaccine Status; tie-handling; ethnicity mapping
  • HES APC (Curated Asset): Parent page for HES APC curated asset children
  • HES APC Diagnosis: Long-format diagnosis table; ICD-9/ICD-10 codes per episode; 10-column structure
  • HES APC Procedures: Long-format OPCS-4 procedure table; three- and four-digit codes per episode
  • Deaths: Parent page for deaths curated asset children
  • Deaths – Single: One record per person from Civil Registration of Deaths; null person_id rows removed
  • Deaths – Cause of Death: Long-format ICD-10 cause of death; underlying and contributory causes; cleaned codes
  • Covid Positive: Consolidated COVID-19 positive records from antigen testing, GDPPR, and secondary care
  • LSOA Lookups: LSOA lookup asset (content coming soon)

Curated Phenotypes

Advanced curated assets combining data, curated assets, and algorithms to
define clinical phenotypes.

Resources

Tools and reference materials supporting researchers working with NHS electronic
health records in the SDE.

  • Resources Overview: Overview of tools developed to support EHR research in the NHS England SDE
  • Dataset Summary Dashboard: Interactive dashboard for data dictionaries, coverage, and completeness across SDE/SAIL/Scottish TRE
  • Codelist Comparison Tool: Web app for comparing codelists; integrates HDR UK Phenotype Library and OpenCodelists
  • Standard Pipeline: Reusable curation pipeline on GitHub for projects with common table/variable/coding requirements
  • Code Terminology: Parent page for clinical coding terminology assets
  • Lookup Tables: Lookup files for READ V2, CTV3, SNOMED-CT, ICD-10, OPCS-4, BNF; Box-hosted with R compilation code
  • Mapping Tables: One-directional terminology mapping files (READ V2→SNOMED, CTV3→SNOMED, ICD-9→ICD-10, etc.)
  • Other Resources: R packages: Rdiagnosislist (SNOMED-CT in R) and clinconcept (concept dictionaries)
  • Phenotype Library Resources: HDR UK Phenotype Library overview; BHF DSC submission policy; API guidance
  • Codelist Formatting: R script to split master CSV into per-phenotype per-terminology files for library submission
  • Creating YAML Files: R script to generate phenotype YAML metadata files from Excel input for API upload
  • Batch Upload Phenotypes: R script using ConceptLibraryClient to batch-upload YAML phenotype definitions via API
  • Submission Instructions: Step-by-step guide for submitting phenotype definitions via the library interface or API

Useful Updates

Release notes and dataset change notices for quarterly provisioning batches.

Presentations

  • Presentations: PDF/PPTX slides on pseudonymised patient IDs, coding best practices, and defining study time periods in the SDE

How to Cite

  • How to Cite: Citation templates for documentation, Codelist Comparison Tool, Dataset Summary Dashboard, curated assets, and standard pipeline

@zwelshman zwelshman requested a review from fionnachalmers April 2, 2026 09:39
@zwelshman zwelshman self-assigned this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant