PhenoXtract is a configurable ETL (Extract-Transform-Load) pipeline and crate written in Rust for converting structured tabular data sources (CSV, Excel, and potentially others) into * *Phenopackets v2.0**.
It provides a flexible, configuration-driven approach to map clinical cohort data into standardized, ontology-aware * Phenopacket JSON objects*, ready for downstream analysis, sharing, or storage. Configuration can be supplied in * YAML, TOML, or JSON* formats.
-
Extract
- Supports CSV and Excel (
.xlsx) files as input. - Handles flexible orientations: patients as rows or patients as columns (automatic transposition).
- Automatic casting of column types:
bool(true/false)intfloatdate(YYYY-MM-DD,DD-MM-YYYY,MM/DD/YYYY, etc.)datetime(ISO8601, RFC 822, RFC3339)
- Regex and multi-column matching for identifiers.
- Default or generated headers when missing.
- Supports CSV and Excel (
-
Transform
- Context-driven table interpretation (
TableContext,SeriesContext). - Maps raw values into Phenopacket semantic fields:
- Subject info (ID, sex, age, living status, weight, smoker, etc.)
- Phenotypes (
hpo_id,hpo_label,observation_status, onset)
- Transformation strategies such as alias mapping, where cell values are mapped to other aliases (e.g.
"M" -> "Male","smoker" -> true,"neoplasma" -> 4), and strategies to find HPO synonyms of cell values. - Integrated with the Human Phenotype Ontology (HPO) via the OBO Library and BioRegistry APIs (legacy GitHub registry deprecated).
- Additional transformation strategies:
- Multi-column HPO term expansion
- Ontology normalization
- Variant syntax parsing and phenopacket linting for schema validation
- Context-driven table interpretation (
-
Load
- Output Phenopackets (v2.0 JSON) to the filesystem (more loaders can be added later).
- Output directory is configurable via the loader settings (defaults to the working directory unless otherwise specified).
-
Configurable
- Single
PhenoXtractConfigfile (YAML/TOML/JSON/RON) defines:- Data sources (CSV/Excel).
- Table contexts (how to interpret columns/rows).
- Pipeline behavior (transformation strategies, loader).
- Meta-data for the resulting phenopackets (
created_by,submitted_by,cohort_name). created_byis optional and defaults to"phenoxtract-{crate_version}".
- Single
-
Validation
- Ensures configs are well-formed.
- Validates data against expected schema contexts.
- Automatic validation of configuration and data schemas before execution.
- Includes consistency checks for table context, multi-series validation, and phenopacket conformance verification.
The configuration file can be in YAML, TOML, or JSON format.
data_sources:
- type: "csv"
source: "./data/cohort.csv"
separator: ","
extraction_config:
name: "patients"
has_headers: true
patients_are_rows: true
context:
name: "patient_table"
context:
- identifier: "patient_id"
header_context: subject_id
data_context: hpo_label
alias_map:
"M": "Male"
"F": "Female"
"smoker": true
"neoplasma": 4
"height": 1.85
pipeline_config:
transform:
strategies:
- alias_mapping
- fill_null
loader:
type: "file_system"
meta_data:
# created_by is optional; defaults to "phenoxtract-{version}" if not provided
submitted_by: "Dr. Example"
cohort_name: "Example Cohort 2025"This config defines:
- One CSV data source.
- Patients as rows, headers included.
- Maps patient_id column into subject_id and hpo_label.
- Applies transformation strategies (alias mapping, fill null).
- Saves output phenopackets to disk using the file_system loader.
- Rust (stable toolchain recommended)
- Cargo
Clone the repo and build:
git clone https://github.com/P2GX/phenoxtract.git
cd phenoxtract
cargo build --releaseThe CLI is now functional -- you can execute the full ETL pipeline directly:
cargo run -- --config ./config.yamlYou can also use the crate as a library in Rust for integration or testing purposes:
use phenoxtract::config::phenoxtract_config::PhenoXtractConfig;
fn main() {
let config = PhenoXtractConfig::load("config.yaml".into())
.expect("Invalid configuration");
if let Some(pipeline_config) = config.pipeline_config() {
let pipeline = phenoxtract::pipeline::Pipeline::from_config(&pipeline_config)
.expect("Failed to build pipeline");
// Run extraction + transform + load
// pipeline.run(&mut config.data_sources()).unwrap();
}
}The project includes extensive unit tests using rstest and tempfile. Tests cover:
- Loading configs from all supported formats (YAML, TOML, JSON, RON).
- Default and custom metadata.
- Extraction from CSV/Excel in both row- and column-oriented layouts.
- Auto-casting of datatypes.
- Context and alias mapping validation.
Integration tests under
/tests/verify end-to-end ETL correctness and Phenopacket schema compliance.
data_sources:
- type: "csv"
source: "./data/cohort.csv"
separator: ","
extraction_config:
name: "patients"
has_headers: true
patients_are_rows: true
context:
name: "patient_table"
context:
- identifier: "patient_id"
header_context: subject_id
data_context: hpo_label
alias_map:
"M": "Male"
"F": "Female"pipeline: transform_strategies:
- "alias_map"
- "fill_null" loader: "file_system"
Run all tests with:
cargo nextest run --workspace --lib --all-targets --all-features- Each patient/row in the input is transformed into a Phenopacket JSON object (v2.0 schema).
- Metadata (
created_by,submitted_by,cohort_name) is automatically included. - Files are written to the configured output directory (default: working directory unless otherwise specified).
- Additional loaders (e.g., database, API).
- Richer transformation strategies (beyond alias mapping and fill-null).
- Expanded ontology support (HPO synonyms, MONDO, etc.).
- Rouven Reuter
- Patrick Simon Nairne
- Peter Robinson
- Varenya Jain
MIT - see LICENSE for details.