Skip to content

UoSDDNB/PathPinpointR

Repository files navigation

PathPinpointR

GitHub R package version

PathPinpointR identifies the position of a sample upon a trajectory.

Assumptions:
  • Sample is found upon the chosen trajectory.
  • Sample is from a distinct part of the trajectory. A sample with cells that are evenly distributed across the trajectory will have a predicted location at the centre of the trajectory.

Example Workflow

This vignette will take you through the basics running PPR. The data used here is an integrated data-set of blastocyst data.

Installation

Option 1: Install using Conda (Recommended)

The easiest way to set up PathPinpointR is using conda, which automatically manages all dependencies.

  1. Create the conda environment from the provided environment.yml file:
conda env create -f environment.yml
  1. Activate the environment:
conda activate pathpinpointr
  1. Install additional Bioconductor packages (not available via conda):
Rscript -e "BiocManager::install('DelayedMatrixStats')"
  1. Install GeneSwitches (not available via conda, must be installed from GitHub):
Rscript -e "devtools::install_github('SGDDNB/GeneSwitches')"
  1. Install PathPinpointR:
# From GitHub (recommended)
Rscript -e "devtools::install_github('UoSDDNB/PathPinpointR')"

# Or from local directory if you have the source code
Rscript -e "devtools::install('.')"

The conda environment includes: - R (version 4.0+) - All CRAN dependencies (ggplot2, ggrepel, fastglm, plyr, RColorBrewer, ggridges, gridExtra, mixtools, vioplot, devtools, rmarkdown, BiocManager) - All Bioconductor dependencies (Biobase, SingleCellExperiment, SummarizedExperiment, slingshot, monocle) - Seurat (via conda-forge) - Build tools (gcc, gfortran, make)

Option 2: Manual Installation

Install required packages

Run the following code to load all packages neccecary for PPR & this vignette.

required_packages <- c("SingleCellExperiment", "Biobase", "fastglm", "ggplot2",
                       "monocle", "plyr", "RColorBrewer", "ggrepel", "ggridges",
                       "gridExtra", "devtools", "mixtools", "Seurat",
                       "parallel", "RColorBrewer", "slingshot",
                       "DelayedMatrixStats")

# for packages installed from CRAN
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# for packages installed from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) BiocManager::install(new_packages)

# for package "GeneSwitches"
devtools::install_github("SGDDNB/GeneSwitches")

Rtools is required to install PathPinpointR and GeneSwitches from source on Windows. You can download it here: https://cran.r-project.org/bin/windows/Rtools

install PathPinpointR

You can install the development version of PathPinpointR using:

devtools::install_github("UoSDDNB/PathPinpointR")

Load the required packages

library(PathPinpointR)
library(Seurat)
library(ggplot2)
library(SingleCellExperiment)
library(slingshot)
library(RColorBrewer)
library(GeneSwitches)

Load the reference data

The reference dataset is a Seurat object of a blastocyst dataset. The data is an integrated dataset of 8 studies. The data has been downsampled and filtered to only inlcude ebiblast cells.

# download the example data
get_example_data()

# Load the reference data to the environment
reference_seu <- readRDS("./reference.rds")
# Load the sample data to the environment
samples_seu <- lapply(c("./LW120.rds", "./LW122.rds"), readRDS)
# name the samples in the list
names(samples_seu) <- c("LW120", "LW122")

View the reference UMAP plot

Plot the reference data, colored by the day of development.

DimPlot(object = reference_seu,
        reduction = "umap",
        group.by = "time",
        label = TRUE) +
  ggtitle("Reference")

The plot shows the development of the epiblast cells. Some labels are a mix of samples from different days.

Convert to SingleCellExperiment objects.

Prior to running slingshot and GeneSwitches, we need to convert the Seurat objects to SingleCellExperiment objects.

reference_sce    <- SingleCellExperiment(assays = list(expdata = reference_seu@assays$RNA$counts))
colData(reference_sce) <- DataFrame(reference_seu@meta.data)
reducedDims(reference_sce)$UMAP <- reference_seu@reductions$umap@cell.embeddings

# create an empty list to to store the sce sample objects
samples_sce <- list()
# Iterate through each Seurat object in the samples list
for (i in seq_along(samples_seu)){
  # convert each sample to a SingleCellExperiment object & store in the list
  samples_sce[[i]] <- SingleCellExperiment(assays = list(expdata = samples_seu[[i]]@assays$RNA$counts))
}
# carry over the sample names from the Seurat objects.
names(samples_sce) <- names(samples_seu)

Run slingshot

Run slingshot on the reference data to produce pseudotime for each cell.

reference_sce  <- slingshot(reference_sce,
                            clusterLabels = "seurat_clusters",
                            start.clus  = "2",
                            end.clus = "1",
                            reducedDim = "UMAP")

#Rename the Pseudotime column to work with GeneSwitches
colData(reference_sce)$Pseudotime <- reference_sce$slingPseudotime_1

Plot the slingshot trajectory.

The plot shows the trajectory of the blastocyst data, with cells colored by pseudotime.

# Generate colors
colors <- colorRampPalette(brewer.pal(11, "Spectral")[-6])(100)
plotcol <- colors[cut(reference_sce$slingPseudotime_1, breaks = 100)]
# Plot the data
plot(reducedDims(reference_sce)$UMAP, col = plotcol, pch = 16, asp = 1)
lines(SlingshotDataSet(reference_sce), lwd = 2, col = "black")

Binarize the Gene Expression Data

Using the package GeneSwitches, binarize the gene expression data of the reference and query data-sets, with a cutoff of 1.

# binarize the expression data of the reference
reference_sce <- binarize_exp(reference_sce,
                              fix_cutoff = TRUE,
                              binarize_cutoff = 1,
                              ncores = 1)

# Find the switching point of each gene in the reference data
reference_sce <- find_switch_logistic_fastglm(reference_sce,
                                              downsample = TRUE,
                                              show_warning = FALSE)

Note: both binatize_exp() and find_switch_logistic_fastglm(), are time consuming processes and may take tens of minutes to run.

Visualise the switching genes

generate a list of switching genes, and visualise them on a pseudo timeline.

switching_genes <- filter_switchgenes(reference_sce,
                                      allgenes = TRUE,
                                      r2cutoff = 0)

# Plot the timeline using plot_timeline_ggplot
plot_timeline_ggplot(switching_genes,
                     timedata = colData(reference_sce)$Pseudotime,
                     txtsize = 3)

Indervidual switching genes can be visualised using Seurat’s FeaturePlot function.

FeaturePlot(reference_seu, features = switching_genes$geneID[1], reduction = "umap", min.cutoff = "q10", max.cutoff = "q90") + ggtitle(switching_genes$geneID[1])

Select a number of switching genes

Using the PPR function precision():

precision(reference_sce)

Narrow down the search to find the optimum number of switching genes.

precision(reference_sce, n_sg_range = seq(50, 150, 1))

Filter and re-visualise the switching genes.

switching_genes <- filter_switchgenes(reference_sce,
                                      allgenes = TRUE,
                                      r2cutoff = 0,
                                      topnum = 114)

Note: The number of switching genes significantly affects the accuracy of PPR.
too many will reduce the accuracy by including uninformative genes/noise.
too few will reduce the accuracy by excluding informative genes.
The using precision(), 114 is found to be the optimum for this data.

# Plot the timeline using plot_timeline_ggplot
plot_timeline_ggplot(switching_genes,
                     timedata = colData(reference_sce)$Pseudotime,
                     txtsize = 3)

Measure accuracy

We can calculate the accuracy of PPR in the given trajectory by comparing the predicted position of the reference cells to their pseudotimes defined by slingshot. the accuracy can be visualised using acuracy_test.

# reduce the reference data to only include the switching genes.
reference_reduced_sce <- reduce_counts_matrix(reference_sce, switching_genes)
# predict the position of cells in the reference trajectory.
reference_ppr <- predict_position(reference_reduced_sce, switching_genes)

# plot the accuracy of the prediction
accuracy_test(reference_ppr, reference_sce, plot = TRUE)

Binarize the sample data

Binarize the gene expression data of the samples.

# First reduce the sample data to only include the switching genes.
samples_sce <- lapply(samples_sce, reduce_counts_matrix, switching_genes)

# binarize the expression data of the samples
samples_binarized <- lapply(samples_sce,
                            binarize_exp,
                            fix_cutoff = TRUE,
                            binarize_cutoff = 1,
                            ncores = 1)

Predict Position

Produce an estimate for the position of each cell in each sample. The prediction is stored as a PPR_OBJECT.

# Iterate through each Seurat object in the predicting their positons,
# on the reference trajectory, using PathPinpointR.
samples_ppr <- lapply(samples_binarized, predict_position, switching_genes)

Plotting the predicted position of each sample:

plot the predicted position of each sample on the reference trajectory.

# show the predicted position of the first sample
# include the position of cells in the reference data, by a given label.
ppr_plot() +
  reference_idents(reference_sce, "time") +
  sample_prediction(samples_ppr[[1]], label = "Sample 1", col = "red")

# show the predicted position of the second sample
sample_prediction(samples_ppr[[2]], label = "Sample 2", col = "blue")

# show the points at which selected genes switch.
switching_times(c("TKTL1", "CYYR1", "KHDC1L"), switching_genes)

A simpler plot of your results can be generated with the function ppr_vioplot().

ppr_vioplot(samples_ppr, reference_sce, ident = "time")

About

A method of identifying the position of a sample upon a trajectory.

Resources

License

Unknown, GPL-3.0 licenses found

Licenses found

Unknown
LICENSE
GPL-3.0
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Contributors

Languages