GitHub - tanglab/HARE: Harmonizing genetic ancestry and self-identified race/ethnicity

tanglab / HARE Public

Notifications You must be signed in to change notification settings
Fork 3
Star 3

Harmonizing genetic ancestry and self-identified race/ethnicity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
input		input
output		output
tmp		tmp
LICENSE		LICENSE
README.txt		README.txt
demo.R		demo.R
hare_batch.sh		hare_batch.sh
hare_demo.zip		hare_demo.zip
para1.R		para1.R
svm1summ.sh		svm1summ.sh
svm2summ.R		svm2summ.R
svm2summ.sh		svm2summ.sh
svm_fold.R		svm_fold.R
svm_fold.sh		svm_fold.sh

Repository files navigation

#-------------------------------------------------------------------------------
This directory contains bash and R scripts (R package "e1071" is required) for training and assigning HARE model, described in Fang et al. (in preparation). The scripts are based on LSF batch job system. For other job systems, the submitting command in bash scripts should be modified. "LSB_JOBNAME" in Line 23 and "LSB_JOBINDEX" in Line 25 for "svm_fold.sh" should also be modified according to the job management system. 

Contact: Huaying Fang (hyfang@stanford.edu)
Date: 20190213
#-------------------------------------------------------------------------------
hare_batch.sh: This is the main driver shell script and outlines a five-step procedure that selects the tuning parameters for SVM through cross-validation and assigns HARE. The five steps need to be run sequentially; steps 2 and 4 spawn out multiple jobs.

/********Input Data********/
The directory "input" contains an example data. The input files are (1) self-reported race/ethnicity file "pop1_30pcs.txt" and (2) principal components "pop1_sire.txt". See input/ for examples. For the SIRE file, individuals with missing/inconsistent SIRE should be coded as NA. For both SIRE and PC files, there should be one column named "IID".
These two files are used by demo.R to generate a R data file "input_sirePC.rdat" including a data.frame object "data_svm" for subsequent steps.

/********HARE Steps********/
Step 1: Run para1.R and set up a coarse grid for tuning parameters. There are two tuning parameters for SVM; they are selected through a (coarse) grid search, followed by a second grid search on a finer grid. The first step generates a list of the parameter combinations representing the coarse grid.
tmp/para1.csv: example input parameter list file, which is generated by script para1.R.
The range and step size of the grid are specified in para.R.

Step 2: Run first round of tuning parameter selection on the coarse grid: train a SVM at each parameter combination using five-fold CV. 
svm_fold.sh calls svm_fold.R, which trains SVM at a specific set of parameters using one training-testing data split, and output the testing accuracy. Individual output files are written to a folder tmp/svm1. 
This step runs nfold * ngrid_points jobs in parallel (in our setup,  5 x (5x6) = 150 jobs)

Step 3: Using the first round of grid search to narrow down the range of second round of grid search.
svm1summ.sh: Aggregate the CV accuracy across all folds; compare the accuracy across grid points to narrow to a smaller region for the second round of finer grid search. At the end, set up input parameters (analogous to tmp/para1.csv, called tmp/para2.csv) for the second round of grid search. 
This script needs to wait for all jobs in step 2 to finish.

Step 4: Run second round of tuning parameter selection on the finer grid: train a SVM at each parameter combination using five-fold CV.
This step calls svm_fold.sh again, just different parameters.

Step 5: Analogous to Step 3, aggregate the second round of grid search to find the optimal tuning parameters. Using this parameter value, HARE will be assigned to all individuals.
This script needs to wait for all jobs in step 4 to finish.

/*********Output***********/
The output directory "output/" for HARE includes 2 files. "HARE_output.txt" is the HARE assignments, and includes 3 columns, "IID", "sire" and "hare." The R data file "HARE_output.txt" includes 4 R objects, "data_hare" is the HARE assignments, "mod_svm" is the SVM model trained on individuals with SIRE, "P1P2Psire" is the probabilities ratios and L1 class, and "pred_svm" is the probability prediction for all individuals.

#-------------------------------------------------------------------------------
The directory "tmp" is a temporary folder including cross validation (CV) information. The files "para1.csv" and "para2.csv" are the parameter files for first and second round CV. The files "summ1.csv" and "summ2.csv" are cross validation precisions for first and second round CV. The files "svm1raw.csv" and "svm2raw.csv" are the collections for prediction accuracy under "svm1/" and "svm2/".