This repository maintains an end to end downstream workflow (post FASTQ alignment) of bulk RNA sequencing analysis from gene read counts to functional enrichment analysis.
This workflow has been developed and tested on an Ubuntu 22.04 and 20.04 LTS machine so the individual R packages might not fully work on a Windows PC or on MacOS but users are more than welcome to try running it and take note of specific dependencies which might not be cross platform compatible (for possible alternatives) and make improvements to support other operating systems.
Ensure that you have renv installed as this will the main environment management tool used in this workflow. The packages used in the analysis can be easily obtained with
renv::restore()and this will download them by referencing therenv.lockfile.
For Windows: Some of the tools / packages in this analysis workflow might need to be built from source and the respective compilers for C/C++ & Fortran are nicely bundled within
Rtoolsfor Windows.Rtoolscan be installed by running the Rtools42 installer.
Although this is not a standalone package for bulk RNA-seq analysis but more of a pipeline that integrates multiple tools together, there are some helper functions that were written as R source code within the
/srcdirectory. In order to load them for use in the analysis, runsource(here::here("src", "RNA_seq_helper_functions.R"))at the start of each Rmd notebook to ensure the necessary helper functions are loaded too.
The read count tab data are expected as outputs from STAR with its
--quantMode GeneCountsflag turned on. As such, the format for the data would be a few_outputs_ReadsPerGene.out.tabfiles according to the number of RNA-seq samples. These should be placed in the/datafolder along with a tab delimited metadata text file which the user has to provide for DESeq2 to build it's deseq_object.
──RNA-seq-analysis
└──data
├──Wildtype_1_outputs_ReadsPerGene.out.tab
├──Wildtype_2_outputs_ReadsPerGene.out.tab
├──Wildtype_3_outputs_ReadsPerGene.out.tab
├──Wildtype_4_outputs_ReadsPerGene.out.tab
├──Gene_xyz_KO_1_outputs_ReadsPerGene.out.tab
├──Gene_xyz_KO_2_outputs_ReadsPerGene.out.tab
├──Gene_xyz_KO_3_outputs_ReadsPerGene.out.tab
├──Gene_xyz_KO_4_outputs_ReadsPerGene.out.tab
└──treatment_vs_control.meta
For the metadata file
treatment_vs_control.meta, ensure that the row-names match the sample names of the individual RNA-seq samples.
------------- ------------- -------------
| | sampletype | phenotype |
| ------------- | ------------- | ------------- |
| Wildtype_1 | Control | quiescent |
| ------------- | ------------- | ------------- |
| Wildtype_2 | Control | quiescent |
| ------------- | ------------- | ------------- |
| Wildtype_3 | Control | quiescent |
| ------------- | ------------- | ------------- |
| Wildtype_4 | Control | quiescent |
| ------------- | ------------- | ------------- |
| Gene_xyz_KO_1 | Treatment | proliferative |
| ------------- | ------------- | ------------- |
| Gene_xyz_KO_2 | Treatment | proliferative |
| ------------- | ------------- | ------------- |
| Gene_xyz_KO_3 | Treatment | proliferative |
| ------------- | ------------- | ------------- |
| Gene_xyz_KO_4 | Treatment | proliferative |
------------- ------------- -------------
Within the
config.ymlfile, ensure that the order of sample naming matches the metadata file. This would be the order that the read count data is read into R and merged into the countdata.frame. For the example metadata file above, the config file should look like:
Paths:
Treatment_vs_control_meta: "./data/treatment_vs_control.meta"
Treatment_vs_control_data: "./data"
Samples:
Treatment_vs_control:
Wildtype_1: "Wildtype_1"
Wildtype_2: "Wildtype_2"
Wildtype_3: "Wildtype_3"
Wildtype_4: "Wildtype_4"
Gene_xyz_KO_1: "Gene_xyz_KO_1"
Gene_xyz_KO_2: "Gene_xyz_KO_2"
Gene_xyz_KO_3: "Gene_xyz_KO_3"
Gene_xyz_KO_4: "Gene_xyz_KO_4"-
The RNA seq count data was obtained by running STAR aligner with
--quantMode GeneCountsenabled, which outputs the gene counts as a tab delimited file. -
Each file is then merged into a master data frame with its respective sample name as the column name.
-
The merged count data in a single data-frame is then compared against the
metadata.txtfile to ensure that its column names match the metedata row names. This is essential for DESeq2 to recognize the input dataframe as valid. -
Count distribution for each sample can also be plotted to QC checks on the read count distribution.
-
The trend of variance vs mean is also plotted to show the overdispersion phenomenon in RNA-seq count data.
-
An option for CPM filtering with edgeR is also included.
-
The size factors for DESeq2 normalisation are calculated during the initializtion of the DESeq2 object and the DESeq2 normalised counts can be written out to disk.
-
This step performs log transformation of the count data for PCA and other correlative visualisations.
-
Hierarchical clustering and a heatmap of pairwise correlation values of the gene wise counts per sample is performed to show how each sample relates to another. Ideally, samples within each condition should show the highest similarity to one another and most difference to those of the other condition.
-
The differential expression testing with DESeq2's fitting of a general linear model takes place in this stage with
RNA_seq_dds <- DESeq(RNA_seq_dds). The dispersion estimates can subsequently be visualised. -
The contrast is defined to calculate the Log2FoldChange with the appropriate direction (i.e Treatment vs Control) and based on the dispersion estimates, the Log2FoldChange is shrunk appropriately.
-
The MA plot can be visualised here as well as a quick summary of the differential expression results.
-
After setting a Log2FoldChange threshold, the differentially expressed genes can be exported as a CSV file.
- The significant differentially expressed genes can be visualised as a bar-plot, volcano plot and a heatmap.
-
Above is an example of a heatmap of significant differentially expressed genes.
This additional step post DESeq2 differential expression is mainly to accomplish these few objectives:
-
Filter out genes with less than a CPM of 0.2 and persist that data for downstream usage.
-
Annotate the filtered results with HGCN gene symbols, Entrez IDs & gene descriptions, and export as a csv.
-
Other filtering based on Log2FoldChange or Adjust p-value.
-
Export CPM edgeR normalized counts after CPM > 0.2 filter for GSEA.
The CPM filtered significant differentially expressed genes (sig_DEGs) will be piped into ClusterProfiler, a tool for performing functional enrichment analysis. ClusterProfiler houses a collection of tools and methods to perform a wholistic semantic analysis (drawing meaning from the data) of omics data by looking at global (and local) changes in the data and making relevant comparisons with publicly available curated omics and gene ontology databases.
A detailed walk-through of the major functionality of ClusterProfiler can be found here as well as the code using the tools within ClusterProfiler as separate Rmd files here.
For this workflow, we only use a number of the tools within ClusterProfiler that might be more commonly used in a typical RNA-seq analysis:
-
enrichGO()function with:- Biological Processes (BP)
- Molecular Function (MF)
- Cellular Component (CC)
-
Bar plots & dot plots of enrichment results for gene ontology over representation analysis (ORA).
-
A number of disease-gene enrichment:
enrichDGN()Disease Gene NetworkenrichNCG()Network of Cancer GenesenrichDO()Disease Ontology
-
Reactome pathway
enrichPathway()and KEGG pathwayenrichKEGG()over-representation analysis. -
Enrichment maps for gene sets from GO-BP, GO-MF & GO-CC.
-
More to come soon . . .
- All R scripts should be within
/srcand functions should be imported from there as to not clutter code. - No data should be committed to git and file paths for data needed to be imported into R should be included in the
config.yamlfile. If file paths contain sensitive information, they can be added to gitignore. - All analysis code should be in R markdown format as a
.rmdfile. Comments regarding analysis can be written as plain text in the R markdown file. - renv.lock files can be committed to git for package version tracking and restoring a new renv on another system.
Certain environment and user profile variables can be tweaked in the .Rprofile file but that file is gitignored for now.
Common issues can be tracked here along with their respective fixes:
- For the issue of not being able to fetch packages from Bioconductor, it might be attributed to a connection timeout. The fix for that is to add the
options(timeout=600)for a longer timeout duration of 10 minutes. Example
- Proper README and first PR.
- In depth explanation of each step of the analysis.
- Implement full functional enrichment analysis.
- Implement some version of Pathway Topology analysis (either NetGSA or SPIA or others if found).
- Comparative analysis with other pan cancer omics databases (e.g. TCGA).

