Skip to content

balchlab/scv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GP-SCV

Gaussian Process Spatial CoVariance (GP-SCV) package is a Guassian Process using Kriging model for spatial prediction of unsampled locations from the known training set. The package is designed in a way that one dimension is a gene (or protein) sequence and the remaining dimensions are phenotypes linked to the gene or protein, to understand the sequence-function-structure relationship at a fundamental level.

Installation

Minimal requirements:

  • R (3.5 and above) (see requirements.txt)
  • SLURM scheduler
  • Clone the repository; unzip it;
  • Generate the input data (see Input data);
  • Run GP-SCV in batch mode

Quick Start

Replace PHENOTYPE_1 and PHENOTYPE_2 with phenotypes of interest on y and z axes and the GENE with gene or protein name:

  Rscript kriging.r PHENOTYPE_1 PHENOTYPE_2 GENE

Input

An input.RData workspace file is required containing the following variables:

  • --df : a data frame object in the format shown in example table
  • --y_vars : a list of strings giving the column names in df of phenotypes that should be used as y variables
  • --z_vars : a list of strings giving the column names in df of phenotypes that should be used as z variables
  • --x_start : an integer giving the starting position for the sequence under consideration (e.g. 1 for the first amino acid in a protein sequence)
  • --x_end : an integer giving the end position for the sequence under consideration (e.g. n for a protein with n amino acids)
  • --x_name : a string giving the name of the column in df with entries denoting the residue position of each variant – a number from 1 to x_length (this should not be ‘x’)

This file will be generated automatically if using the VCF data preparation pipeline. To run, call the script run_scv.sh with the following arguments, in order:

  • --project : the name of the specific project
  • --dateStamp : a date stamp as a unique identifier for each run (e.g. yyyymmdd or yyyymmdd-01 etc.)
  • --note : any note to be amended to the file names of the run
  • --project_dir : the directory where the input.RData file is located, and where the output directories and files will be saved.

For VCF Input file look VCF_input folder.

Output

The following files and directories will be outputted during the run.

  • Plots and figures :
    • Hierarchical clustering map of all (y,z) variable pairs.
    • For each (y,z) variable pair:
       (x,y) scatter plot,
       (y,z) scatter plot,
    • Plot showing the sample and analytic variogram,
    • Inverse variance weighted barcode plot,
    • SCV interpolated landscape plot.
  • Metadata :
    • csv file containing details of linear (y,z) R correlation value and SCV prediction R value for all (x,y,z) combinations,
    • Log run file from code run.
    • For each (y,z) variable pair:
       Bin size,
       Range/cut-off,
       Variogram formula.
  • Data files :
    • Normalised input file with (x,y,z) sample data points (.csv),
    • SCV prediction output file, predicting z across all (x,y) points in the landscape (.csv),
    • SCV prediction variance output file, giving the variance of the z prediction across all (x,y) points in the landscape (.csv),
    • IVW output file, giving the z prediction across all x points in the sequence (.csv).

About

Spatial covariance package

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors