Gaussian Process Spatial CoVariance (GP-SCV) package is a Guassian Process using Kriging model for spatial prediction of unsampled locations from the known training set. The package is designed in a way that one dimension is a gene (or protein) sequence and the remaining dimensions are phenotypes linked to the gene or protein, to understand the sequence-function-structure relationship at a fundamental level.
Minimal requirements:
- R (3.5 and above) (see requirements.txt)
- SLURM scheduler
- Clone the repository; unzip it;
- Generate the input data (see Input data);
- Run GP-SCV in batch mode
Replace PHENOTYPE_1 and PHENOTYPE_2 with phenotypes of interest on y and z axes and the GENE with gene or protein name:
Rscript kriging.r PHENOTYPE_1 PHENOTYPE_2 GENE
An input.RData workspace file is required containing the following variables:
- --df : a data frame object in the format shown in example table
- --y_vars : a list of strings giving the column names in df of phenotypes that should be used as y variables
- --z_vars : a list of strings giving the column names in df of phenotypes that should be used as z variables
- --x_start : an integer giving the starting position for the sequence under consideration (e.g. 1 for the first amino acid in a protein sequence)
- --x_end : an integer giving the end position for the sequence under consideration (e.g. n for a protein with n amino acids)
- --x_name : a string giving the name of the column in df with entries denoting the residue position of each variant – a number from 1 to x_length (this should not be ‘x’)
This file will be generated automatically if using the VCF data preparation pipeline. To run, call the script run_scv.sh with the following arguments, in order:
- --project : the name of the specific project
- --dateStamp : a date stamp as a unique identifier for each run (e.g. yyyymmdd or yyyymmdd-01 etc.)
- --note : any note to be amended to the file names of the run
- --project_dir : the directory where the input.RData file is located, and where the output directories and files will be saved.
For VCF Input file look VCF_input folder.
The following files and directories will be outputted during the run.
- Plots and figures :
- Hierarchical clustering map of all (y,z) variable pairs.
- For each (y,z) variable pair:
(x,y) scatter plot,
(y,z) scatter plot, - Plot showing the sample and analytic variogram,
- Inverse variance weighted barcode plot,
- SCV interpolated landscape plot.
- Hierarchical clustering map of all (y,z) variable pairs.
- Metadata :
- csv file containing details of linear (y,z) R correlation value and SCV prediction R value for all (x,y,z) combinations,
- Log run file from code run.
- For each (y,z) variable pair:
Bin size,
Range/cut-off,
Variogram formula.
- csv file containing details of linear (y,z) R correlation value and SCV prediction R value for all (x,y,z) combinations,
- Data files :
- Normalised input file with (x,y,z) sample data points (.csv),
- SCV prediction output file, predicting z across all (x,y) points in the landscape (.csv),
- SCV prediction variance output file, giving the variance of the z prediction across all (x,y) points in the landscape (.csv),
- IVW output file, giving the z prediction across all x points in the sequence (.csv).
- Normalised input file with (x,y,z) sample data points (.csv),