GP-SCV

Gaussian Process Spatial CoVariance (GP-SCV) package is a Guassian Process using Kriging model for spatial prediction of unsampled locations from the known training set. The package is designed in a way that one dimension is a gene (or protein) sequence and the remaining dimensions are phenotypes linked to the gene or protein, to understand the sequence-function-structure relationship at a fundamental level.

Installation

Minimal requirements:

R (3.5 and above) (see requirements.txt)
SLURM scheduler
Clone the repository; unzip it;
Generate the input data (see Input data);
Run GP-SCV in batch mode

Quick Start

Replace PHENOTYPE_1 and PHENOTYPE_2 with phenotypes of interest on y and z axes and the GENE with gene or protein name:

  Rscript kriging.r PHENOTYPE_1 PHENOTYPE_2 GENE

Input

An input.RData workspace file is required containing the following variables:

--df : a data frame object in the format shown in example table
--y_vars : a list of strings giving the column names in df of phenotypes that should be used as y variables
--z_vars : a list of strings giving the column names in df of phenotypes that should be used as z variables
--x_start : an integer giving the starting position for the sequence under consideration (e.g. 1 for the first amino acid in a protein sequence)
--x_end : an integer giving the end position for the sequence under consideration (e.g. n for a protein with n amino acids)
--x_name : a string giving the name of the column in df with entries denoting the residue position of each variant – a number from 1 to x_length (this should not be ‘x’)

This file will be generated automatically if using the VCF data preparation pipeline. To run, call the script run_scv.sh with the following arguments, in order:

--project : the name of the specific project
--dateStamp : a date stamp as a unique identifier for each run (e.g. yyyymmdd or yyyymmdd-01 etc.)
--note : any note to be amended to the file names of the run
--project_dir : the directory where the input.RData file is located, and where the output directories and files will be saved.

For VCF Input file look VCF_input folder.

Output

The following files and directories will be outputted during the run.

Plots and figures :
- Hierarchical clustering map of all (y,z) variable pairs.
- For each (y,z) variable pair:
  (x,y) scatter plot,
  (y,z) scatter plot,
- Plot showing the sample and analytic variogram,
- Inverse variance weighted barcode plot,
- SCV interpolated landscape plot.
Metadata :
- csv file containing details of linear (y,z) R correlation value and SCV prediction R value for all (x,y,z) combinations,
- Log run file from code run.
- For each (y,z) variable pair:
  Bin size,
  Range/cut-off,
  Variogram formula.
Data files :
- Normalised input file with (x,y,z) sample data points (.csv),
- SCV prediction output file, predicting z across all (x,y) points in the landscape (.csv),
- SCV prediction variance output file, giving the variance of the z prediction across all (x,y) points in the landscape (.csv),
- IVW output file, giving the z prediction across all x points in the sequence (.csv).

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
VCF_input		VCF_input
.gitignore		.gitignore
README.md		README.md
kriging_functions.r		kriging_functions.r
kriging_run.r		kriging_run.r
requirements.txt		requirements.txt
run.sh		run.sh
run_scv.sh		run_scv.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GP-SCV

Installation

Quick Start

Input

Output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GP-SCV

Installation

Quick Start

Input

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages