cd ../..Assignment #2
+You will need PLINK2 installed and available in your PATH. Please follow the OS-specific setup guide in SETUP.md. The dataset for this assignment consists of the following binary PLINK files: gwa.A2.bed, gwa.A2.bim, gwa.A2.fam , available at the following Google Drive link: https://drive.google.com/drive/folders/1rHoy3z52Yyj985ukjjtLhfIBxchpyYtZ?usp=drive_link. Please download all three files and save them in 02_activities/data/.
Question 1: Data inspection
+Before you run any models, first get familiar with the dataset. You may find data.table::fread() in R helpful for reading .bim and .fam files.
-
+
- Read the
.famfile. How many samples does the dataset contain?
+
knitr::opts_knit$set(root.dir = normalizePath("../../"))library(data.table)
+library(qqman)
+For example usage please run: vignette('qqman')
+
+Citation appreciated but not required:
+Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.
+
+fam <- data.table::fread("./02_activities/data/gwa.A2.fam")
+
+n_samples <- nrow(fam)
+n_samples[1] 4000
+-
+
- Read the
.bimfile. How many SNPs does the dataset contain?
+
bim <- data.table::fread("./02_activities/data/gwa.A2.bim")
+
+n_snps <- nrow(bim)
+n_snps[1] 306102
+-
+
- Make a histogram (or density plot) of the phenotype. Does it look roughly Gaussian? +
phenotype <- fam[[6]]
+
+hist(
+ phenotype,
+ main = "Histogram of Phenotype",
+ xlab = "Phenotype Value",
+
+ col = "lightblue",
+ border = "white"
+)
Question 2: Quality control (QC)
+Now we will perform QC using PLINK2 for the genotype files in gwa.A2.
-
+
- Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (
--geno) ≤ 0.01, individual missingness (--mind) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset asgwa.qc.A2.
+
+plink2 \
+ --bfile ./02_activities/data/gwa.A2 \
+ --maf 0.05 \
+ --geno 0.01 \
+ --mind 0.10 \
+ --hwe 0.00005 \
+ --make-bed \
+ --out ./02_activities/data/gwa.qc.A2PLINK v2.0.0-a.7 AVX2 (10 Jan 2026) cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.A2
+ --geno 0.01
+ --hwe 0.00005
+ --maf 0.05
+ --make-bed
+ --mind 0.10
+ --out ./02_activities/data/gwa.qc.A2
+
+Start time: Thu Apr 02 14:08:29 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.A2.fam.
+306102 variants loaded from ./02_activities/data/gwa.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating sample missingness rates... 0%21%42%64%85%done.
+0 samples removed due to missing genotype data (--mind).
+4000 samples (2000 females, 2000 males; 4000 founders) remaining after main
+filters.
+4000 quantitative phenotype values remaining after main filters.
+Calculating allele frequencies... 0%21%42%64%85%done.
+--geno: 196578 variants removed due to missing genotype data.
+--hwe: 6 variants removed due to Hardy-Weinberg exact test (founders only).
+8435 variants removed due to allele frequency threshold(s)
+(--maf/--max-maf/--mac/--max-mac).
+101083 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc.A2.bed ... 0%21%43%65%87%done.
+End time: Thu Apr 02 14:08:30 2026
+Question 4: principal component analysis (PCA)
+-
+
Using PLINK2, perform PCA on the unrelated, LD-pruned dataset to obtain principal components.
+(Hint: Refer to the PCA section in the GWAS Tutorial II. You should use a
*.prune.infile to select a set of independent SNPs before performing PCA.)
+
+plink2 \
+ --bfile ./02_activities/data/gwa.qc_unrelated.A2 \
+ --extract ./02_activities/data/gwa.qc.A2.prune.in \
+ --make-bed \
+ --out ./02_activities/data/gwa.qc_unrelated_pruned.A2
+
+
+plink2 \
+ --bfile ./02_activities/data/gwa.qc_unrelated_pruned.A2 \
+ --pca 20 \
+ --out ./02_activities/data/gwa_unrel_PCA.A2PLINK v2.0.0-a.7 AVX2 (10 Jan 2026) cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_unrelated_pruned.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc_unrelated.A2
+ --extract ./02_activities/data/gwa.qc.A2.prune.in
+ --make-bed
+ --out ./02_activities/data/gwa.qc_unrelated_pruned.A2
+
+Start time: Thu Apr 02 14:08:32 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_unrelated.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc_unrelated.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 21914 variants remaining.
+21914 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc_unrelated_pruned.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc_unrelated_pruned.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc_unrelated_pruned.A2.bed ... 0%61%done.
+End time: Thu Apr 02 14:08:33 2026
+PLINK v2.0.0-a.7 AVX2 (10 Jan 2026) cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa_unrel_PCA.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc_unrelated_pruned.A2
+ --out ./02_activities/data/gwa_unrel_PCA.A2
+ --pca 20
+
+Start time: Thu Apr 02 14:08:33 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_unrelated_pruned.A2.fam.
+21914 variants loaded from ./02_activities/data/gwa.qc_unrelated_pruned.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating allele frequencies... 0%done.
+Constructing GRM: 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
+Correcting for missingness... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
+Extracting eigenvalues and eigenvectors... done.
+--pca: Eigenvectors written to ./02_activities/data/gwa_unrel_PCA.A2.eigenvec ,
+and eigenvalues written to ./02_activities/data/gwa_unrel_PCA.A2.eigenval .
+End time: Thu Apr 02 14:08:36 2026
+Question 5: GWAS analyses
+We now test for association between each SNP and the continuous phenotype, adjusting for population structure using the top 3 PCs.
+Assume that:
+-
+
You will run GWAS on the unrelated, QC’ed dataset
gwa.qc_unrelated.A2.
+You have a covariate file (e.g., the
*.eigenvecoutput from the PCA analysis) containing PC1–PC3 for each individual.
+
-
+
- Using PLINK2, run a linear regression GWAS using
gwa.qc_unrelated.A2as the input dataset. Adjust for PCs 1–3 as covariates.
+
+plink2 \
+ --bfile ./02_activities/data/gwa.qc_unrelated.A2 \
+ --linear \
+ --covar ./02_activities/data/gwa_unrel_PCA.A2.eigenvec \
+ --covar-name PC1,PC2,PC3 \
+ --ci 0.95 \
+ --out ./02_activities/data/gwa.qc_assoc.A2PLINK v2.0.0-a.7 AVX2 (10 Jan 2026) cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_assoc.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc_unrelated.A2
+ --ci 0.95
+ --covar ./02_activities/data/gwa_unrel_PCA.A2.eigenvec
+ --covar-name PC1,PC2,PC3
+ --glm
+ --out ./02_activities/data/gwa.qc_assoc.A2
+
+Start time: Thu Apr 02 14:08:36 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_unrelated.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc_unrelated.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+3 covariates loaded from ./02_activities/data/gwa_unrel_PCA.A2.eigenvec.
+Calculating allele frequencies... 0%64%done.
+--glm linear regression on phenotype 'PHENO1': 0%64%done.
+Results written to ./02_activities/data/gwa.qc_assoc.A2.PHENO1.glm.linear .
+End time: Thu Apr 02 14:08:38 2026
+-
+
- Create a Manhattan plot of the GWAS results. +
library(data.table)
+library(qqman)
+
+assoc <- fread("./02_activities/data/gwa.qc_assoc.A2.PHENO1.glm.linear")
+
+# Keep additive model only
+assoc_add <- assoc[TEST == "ADD"]
+
+# Rename columns for qqman
+setnames(assoc_add, c("#CHROM","POS","ID"), c("CHR","BP","SNP"))
+
+# Manhattan plot
+png("./02_activities/data/manhattan_plot_A2.png", width = 1200, height = 800, res = 150)
+manhattan(
+ assoc_add,
+
+chr = "CHR",
+ bp = "BP",
+ snp = "SNP",
+ p = "P",
+ suggestiveline = FALSE,
+ col = c("lightblue", "lightslateblue")
+)
+dev.off()png
+ 2
+-
+
- Create a QQ plot of the GWAS p-values. +
png("./02_activities/data/QQ_plot_A2.png", width = 1200, height = 800, res = 150)
+qq(assoc_add$P, main = "Q-Q Plot of GWAS P-values (Assignment 2)")
+dev.off()png
+ 2
+Criteria
+| Criteria | +Complete | +Incomplete | +
|---|---|---|
| Data inspection | +Correct sample and SNP counts; phenotype plot with a brief comment on Gaussianity. | +Counts or phenotype description/plot missing or incorrect. | +
| QC & LD pruning | +Correct PLINK2 QC command and thresholds. | +QC/pruning commands, thresholds, or output datasets missing or incorrect. | +
| Relatedness & PCA | +Correct use of PLINK2 command to obtain unrelated samples and PCA run on pruned SNPs. | +Relatedness step, unrelated dataset, or PCA analysis missing or incorrect. | +
| GWAS & visualisation | +Linear regression GWAS with PCs as covariates; Manhattan and QQ plots produced. | +GWAS command, or Manhattan/QQ plots missing or clearly incorrect. | +
Submission Information
+🚨 Please review our Assignment Submission Guide 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.
++
Submission Parameters
+-
+
Submission Due Date: 11:59 PM – 01/04/2026
+The branch name for your repo should be:
assignment-2
+What to submit for this assignment:
+-
+
- Populate this Quarto document (
assignment_2.qmd).
+ - Render the document with Quarto:
quarto render assignment_2.qmd.
+ - Submit both
assignment_2.qmdand the rendered HTML fileassignment_2.htmlin your pull request.
+
+- Populate this Quarto document (
What the pull request link should look like for this assignment:
+https://github.com/<your_github_username>/gen_data/pull/<pr_id>-
+
- Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily. +
+
+
Checklist:
+-
+
- Create a branch called
assignment-2.
+ - Ensure that your repository is public. +
- Review the PR description guidelines and adhere to them. +
- Verify that your link is accessible in a private browser window. +
- Confirm that both
assignment_2.qmdandassignment_2.htmlare included in the pull request.
+