Oncodrive3D ships with a plotting pipeline that turns the clustering results into analysis-ready visualizations and enriched tables. Its goal is threefold:
- help you interpret why a gene or residue achieved a significant 3D-clustering signal by overlaying structural/functional context,
- highlight diagnostic cues that reveal whether a signal looks biologically plausible or could be an artifact,
- provide downstream-analysis assets such as summary panels, per-gene tracks, volcano/log-odds association charts, and annotated CSVs to guide follow-up experiments or reporting.
To keep the runtime reasonable, structural and functional annotations are generated once per dataset via oncodrive3d build-annotations, then reused by oncodrive3d plot for any cohort.
This document describes the prerequisites, the intermediate files that are produced, and how to customize the plots.
- Datasets –
oncodrive3d build-datasets(orbuild-datasets --mane_only) must have been run already; the plotting stage readsdatasets/pdb_structures,seq_for_mut_prob.tsv, andconfidence.tsv. - Python environment – use the same environment (e.g.,
uvvirtualenv) that you rely on for the CLI. - PDB_Tool binary –
oncodrive3d build-annotationsinvokes the PDB_Tool executable namedPDB_Toolon$PATHto compute per-residue solvent accessibility and secondary structure. Install it or wrap it in a container accessible from the command line. - Internet access – required to download Pfam annotations, UniProt features, and (if
--ddg_diris not set) RaSP ΔΔG predictions. You can point--ddg_dirto precomputed RaSP files to skip the download step or to provide in-house predicted scores. - Disk space – annotation folders contain many files; keep several GB free.
Run once per dataset (or whenever you update AlphaFold structures):
oncodrive3d build-annotations \
--data_dir /path/to/datasets \
--output_dir /path/to/annotationsKey options:
--data_dir(required) – the Oncodrive3D datasets folder built viaoncodrive3d build-datasets; annotations pull AlphaFold PDBs and metadata from here.--output_dir– target directory for the annotation bundle (defaults toannotations/; cleaned at each run unless--yes).--ddg_dir– point to precomputed RaSP ΔΔG predictions to skip the download step (mandatory for mouse because only human RaSP files are hosted).--organism– select the species recipe (human or mouse) so the correct external resources are queried.--cores/--yes– control parallelism and disable confirmation prompts when overwriting an existingoutput_dir.
What happens internally (scripts/plotting/build_annotations.py and helpers):
- Cleanup – the target directory is emptied (except for
log/) unless--yesis provided. - Stability change (ΔΔG) – for human datasets the command downloads RaSP predictions (
scripts/plotting/stability_change.py). There is no public RaSP bundle for mouse, so mouse runs must supply their own predictions with--ddg_dirif ΔΔG tracks are desired. In all cases, whichever folder you point to is parsed into JSON maps{position: {ALT: ddg}}, averaging overlapping fragments. - PDB features – AlphaFold structures are decompressed and sent through
PDB_Tool, producing.featurefiles that are then parsed intopdb_tool_df.tsvwith residue-level secondary structure (SSE) and relative accessibility (pACC). - Pfam domains – Pfam coordinates are pulled from the Ensembl BioMart archive plus the Pfam ID database, merged with Oncodrive3D’s sequence metadata, and written to
pfam.tsv. - UniProt features – the EMBL-EBI Proteins API supplies DOMAIN/PTM/SITE/MOTIF/MEMBRANE annotations. They are normalized, merged with Pfam entries, and stored in
uniprot_feat.tsv.
After a successful run the annotation folder contains:
annotations/
├── pdb_tool_df.tsv
├── pfam.tsv
├── uniprot_feat.tsv
├── stability_change/
│ └── <UNIPROT>_ddg.json
└── log/
Keep this directory around—oncodrive3d plot expects all of these files.
Once you have:
- Gene-level results (
<cohort>.3d_clustering_genes.csv), - Residue-level results (
<cohort>.3d_clustering_pos.csv), - Processed mutations (
<cohort>.mutations.processed.tsv), - Missense probability dictionary (
<cohort>.miss_prob.processed.json), seq_for_mut_prob.tsv,- Built datasets (
datasets/) and annotations (annotations/),
call:
oncodrive3d plot \
--gene_result_path output/COHORT/COHORT.3d_clustering_genes.csv \
--pos_result_path output/COHORT/COHORT.3d_clustering_pos.csv \
--maf_path output/COHORT/COHORT.mutations.processed.tsv \
--miss_prob_path output/COHORT/COHORT.miss_prob.processed.json \
--seq_df_path output/COHORT/COHORT.seq_df.processed.tsv \
--datasets_dir /path/to/datasets \
--annotations_dir /path/to/annotations \
--output_dir plots/COHORT \
--cohort COHORT \
--maf_for_nonmiss_path original_input.maf \
--lst_gene_tracks miss_count,miss_prob,score,clusters,ddg,disorder,pacc,ptm,site,sse,pfamKey options:
--gene_result_path,--pos_result_path(required) – paths to<cohort>.3d_clustering_genes.csvand<cohort>.3d_clustering_pos.csvfrom the same run.--maf_path/--maf_for_nonmiss_path–--maf_pathmust point to the processed missense-only TSV created byoncodrive3d run;--maf_for_nonmiss_pathis optional and enables the non-missense track if you provide the original MAF.--miss_prob_path/--seq_df_path– these files (<cohort>.miss_prob.processed.jsonand<cohort>.seq_df.processed.tsv) must come from the same run; they carry the per-residue missense probabilities and mapping metadata needed for plotting.--datasets_dir/--annotations_dir(required) – the dataset folder used during the run and the annotation bundle produced bybuild-annotations; mismatched directories cause missing-track errors.--output_dir/--cohort– where plots/CSVs are written and which cohort label is printed on them.--fdr– pass the column name with BH-FDR–adjusted p-values to display corrected values in the plots; defaults to rawpvalue.--lst_summary_tracks/--lst_gene_tracks– comma-separated lists of track names. If you pass custom lists, match them with--lst_*_hratiosto redistribute vertical space.--genes/--c_genes_only/--max_n_genes– jointly control which genes are rendered. Use--genesfor explicit symbols,--c_genes_onlyto keep only significant hits, and--max_n_genesto cap the total.--volcano_top_n,--summary_fsize_*,--gene_fsize_*– resize summary/association plots;--volcano_top_nsets how many associations are annotated in the cohort-level volcano.--output_csv/--output_all_pos– write the annotated position-level CSV; include non-mutated residues only when--output_all_posis set.
During execution (scripts/plotting/plot.py):
- Results are filtered to genes requested by the user, and the corresponding entries are sliced out of the sequence/annotation tables.
- A summary plot shows per-gene mutation counts, cluster residues, and score distributions.
- Per-gene plots combine multiple tracks: observed vs expected mutation counts, missense probabilities, clustering scores, PAE/pLDDT, ΔΔG, Pfam/UniProt annotations, PTMs, membrane regions, motifs, etc. Tracks that are not available for a gene are automatically removed.
- Annotated tables are built by merging the positional results with disorder (pLDDT), PDB features, transcript metadata, and UniProt domains. They are saved as
<cohort>.3d_clustering_pos.annotated.csvplus<cohort>.uniprot_feat.tsv. - Association plots (optional) – see the “Association Analyses” section below for details on how the logistic-regression statistics are generated and visualized (volcano, per-gene volcano, and log-odds panels).
The optional association module quantifies how strongly specific annotations track with significant clusters:
- Input preparation – residues with non-zero missense probability inherit standardized predictors: structural metrics (pLDDT, ΔΔG, surface exposure), categorical features (Pfam/UniProt/PTM/motif dummies), and the expected missense probability itself.
- Univariate logistic regressions – for each gene and each predictor, the pipeline fits
logit(C ~ feature)whereCis the binary cluster label. This yields log-odds, standard errors, and raw p-values that are stored in<cohort>.logreg_result.tsv. - Visualization – the statistics above feed three plot types under
<cohort>.associations_plots/:- A cohort-wide volcano plot highlighting annotations with the most extreme log-odds and p-values.
- Per-gene mini volcano plots to inspect feature enrichments gene by gene.
- Log-odds strip charts with 95% confidence intervals to visualize effect sizes.
Only raw p-values are provided; apply your preferred multiple-testing correction (e.g., BH-FDR) before drawing conclusions about specific features.
For interactive-ready 3D views, Oncodrive3D exposes a separate oncodrive3d chimerax-plot command. It takes the same gene/position-level CSVs produced by oncodrive3d run, plus the datasets directory (for AlphaFold structures) and the processed sequence dataframe, and renders PNG snapshots together with .defattr files under <output_dir>/<cohort>.chimerax/. Each snapshot shows the AlphaFold model with significant clusters highlighted, optional extended clusters, pLDDT coloring, and sample counts. Provide the path to your ChimeraX installation via --chimerax_bin or rely on the default /usr/bin/chimerax. The Nextflow pipeline exposes the same functionality through the chimerax_plot flag.
Example:
oncodrive3d chimerax-plot \
--gene_result_path output/COHORT/COHORT.3d_clustering_genes.csv \
--pos_result_path output/COHORT/COHORT.3d_clustering_pos.csv \
--datasets_dir /path/to/datasets \
--seq_df_path output/COHORT/COHORT.seq_df.processed.tsv \
--output_dir plots/COHORT \
--cohort COHORT \
--chimerax_bin /opt/ChimeraX/bin/ChimeraX \
--max_n_genes 20 \
--pixel_size 0.1 \
--cluster_ext--gene_result_path,--pos_result_path(required) – gene- and residue-level CSVs from the same run; both are needed to highlight clusters.--seq_df_path– must point to the processed sequence dataframe (<cohort>.seq_df.processed.tsv), which provides transcript IDs, fragments, and mappings needed to overlay residues on structures.--datasets_dir/--output_dir– dataset directory supplying AlphaFold models to load in ChimeraX and the destination for PNG/attribute outputs.--cohort/--max_n_genes– label snapshots and cap the number of genes displayed.--chimerax_bin– path to the ChimeraX executable; override this if ChimeraX is installed outside/usr/bin/chimerax.--pixel_size– controls image resolution (smaller values produce larger images); default0.08.--cluster_ext/--fragmented_proteins– toggles to display extended clusters (mutations that help rescue significance) and to keep AlphaFold-fragmented proteins, respectively.--transparent_bg– render PNGs with a transparent background (handy for figure overlays).--af_version– specify which AlphaFold build (release tag) your datasets use so the script picks the right structure names (default4).
- PDB_Tool missing – install the binary or adjust
$PATH. The build step logs the exact command being executed, making it easier to diagnose permission issues. - Annotation mismatches – plots rely on UniProt IDs matching between the run outputs and the annotation tables. Make sure you pass the same datasets directory used during
build-datasets. - Association plots without data – if a gene lacks both clustered and non-clustered residues after filtering, it is skipped from the logistic regression. The log file will note “There aren’t any relationship to plot”.