Skip to content

P2GX/PhenoXtract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhenoXtract

PhenoXtract is a configurable ETL (Extract-Transform-Load) pipeline and crate written in Rust for converting structured tabular data sources (CSV, Excel, and potentially others) into * *Phenopackets v2.0**.

It provides a flexible, configuration-driven approach to map clinical cohort data into standardized, ontology-aware * Phenopacket JSON objects*, ready for downstream analysis, sharing, or storage. Configuration can be supplied in * YAML, TOML, or JSON* formats.


Features

  • Extract

    • Supports CSV and Excel (.xlsx) files as input.
    • Handles flexible orientations: patients as rows or patients as columns (automatic transposition).
    • Automatic casting of column types:
      • bool (true/false)
      • int
      • float
      • date (YYYY-MM-DD, DD-MM-YYYY, MM/DD/YYYY, etc.)
      • datetime (ISO8601, RFC 822, RFC3339)
    • Regex and multi-column matching for identifiers.
    • Default or generated headers when missing.
  • Transform

    • Context-driven table interpretation (TableContext, SeriesContext).
    • Maps raw values into Phenopacket semantic fields:
      • Subject info (ID, sex, age, living status, weight, smoker, etc.)
      • Phenotypes (hpo_id, hpo_label, observation_status, onset)
    • Transformation strategies such as alias mapping, where cell values are mapped to other aliases (e.g. "M" -> "Male", "smoker" -> true, "neoplasma" -> 4), and strategies to find HPO synonyms of cell values.
    • Integrated with the Human Phenotype Ontology (HPO) via the OBO Library and BioRegistry APIs (legacy GitHub registry deprecated).
    • Additional transformation strategies:
      • Multi-column HPO term expansion
      • Ontology normalization
      • Variant syntax parsing and phenopacket linting for schema validation
  • Load

    • Output Phenopackets (v2.0 JSON) to the filesystem (more loaders can be added later).
    • Output directory is configurable via the loader settings (defaults to the working directory unless otherwise specified).
  • Configurable

    • Single PhenoXtractConfig file (YAML/TOML/JSON/RON) defines:
      • Data sources (CSV/Excel).
      • Table contexts (how to interpret columns/rows).
      • Pipeline behavior (transformation strategies, loader).
      • Meta-data for the resulting phenopackets (created_by, submitted_by, cohort_name).
      • created_by is optional and defaults to "phenoxtract-{crate_version}".
  • Validation

    • Ensures configs are well-formed.
    • Validates data against expected schema contexts.
    • Automatic validation of configuration and data schemas before execution.
    • Includes consistency checks for table context, multi-series validation, and phenopacket conformance verification.

Configuration

The configuration file can be in YAML, TOML, or JSON format.

Example config.yaml

data_sources:
  - type: "csv"
    source: "./data/cohort.csv"
    separator: ","
    extraction_config:
      name: "patients"
      has_headers: true
      patients_are_rows: true
      context:
        name: "patient_table"
        context:
          - identifier: "patient_id"
            header_context: subject_id
            data_context: hpo_label
            alias_map:
              "M": "Male"
              "F": "Female"
              "smoker": true
              "neoplasma": 4
              "height": 1.85

pipeline_config:
  transform:
    strategies:
      - alias_mapping
      - fill_null
  loader:
    type: "file_system"

meta_data:
  # created_by is optional; defaults to "phenoxtract-{version}" if not provided
  submitted_by: "Dr. Example"
  cohort_name: "Example Cohort 2025"

This config defines:

  • One CSV data source.
  • Patients as rows, headers included.
  • Maps patient_id column into subject_id and hpo_label.
  • Applies transformation strategies (alias mapping, fill null).
  • Saves output phenopackets to disk using the file_system loader.

Getting Started

Prerequisites

  • Rust (stable toolchain recommended)
  • Cargo

Installation

Clone the repo and build:

git clone https://github.com/P2GX/phenoxtract.git
cd phenoxtract
cargo build --release

Running

The CLI is now functional -- you can execute the full ETL pipeline directly:

cargo run -- --config ./config.yaml

You can also use the crate as a library in Rust for integration or testing purposes:

use phenoxtract::config::phenoxtract_config::PhenoXtractConfig;

fn main() {
    let config = PhenoXtractConfig::load("config.yaml".into())
        .expect("Invalid configuration");
    if let Some(pipeline_config) = config.pipeline_config() {
        let pipeline = phenoxtract::pipeline::Pipeline::from_config(&pipeline_config)
            .expect("Failed to build pipeline");
        // Run extraction + transform + load
        // pipeline.run(&mut config.data_sources()).unwrap();
    }
}

Testing

The project includes extensive unit tests using rstest and tempfile. Tests cover:

  • Loading configs from all supported formats (YAML, TOML, JSON, RON).
  • Default and custom metadata.
  • Extraction from CSV/Excel in both row- and column-oriented layouts.
  • Auto-casting of datatypes.
  • Context and alias mapping validation. Integration tests under /tests/ verify end-to-end ETL correctness and Phenopacket schema compliance.
data_sources:
 - type: "csv"
 source: "./data/cohort.csv"
 separator: ","
 extraction_config:
 name: "patients"
 has_headers: true
 patients_are_rows: true
 context:
  name: "patient_table"
  context:
   - identifier: "patient_id"
  header_context: subject_id
  data_context: hpo_label
  alias_map:
   "M": "Male"
   "F": "Female"

pipeline: transform_strategies:

  • "alias_map"
  • "fill_null" loader: "file_system"

Run all tests with:

cargo nextest run --workspace --lib --all-targets --all-features

Output

  • Each patient/row in the input is transformed into a Phenopacket JSON object (v2.0 schema).
  • Metadata (created_by, submitted_by, cohort_name) is automatically included.
  • Files are written to the configured output directory (default: working directory unless otherwise specified).

Roadmap

  • Additional loaders (e.g., database, API).
  • Richer transformation strategies (beyond alias mapping and fill-null).
  • Expanded ontology support (HPO synonyms, MONDO, etc.).

Authors

  • Rouven Reuter
  • Patrick Simon Nairne
  • Peter Robinson
  • Varenya Jain

License

MIT - see LICENSE for details.

About

A configurable ETL pipeline for converting structured tabular data sources into Phenopackets

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •