diff --git a/02_activities/assignments/QQ_plot_A2.png b/02_activities/assignments/QQ_plot_A2.png new file mode 100644 index 0000000..b65d6ed Binary files /dev/null and b/02_activities/assignments/QQ_plot_A2.png differ diff --git a/02_activities/assignments/assignment_2.html b/02_activities/assignments/assignment_2.html new file mode 100644 index 0000000..876fb30 --- /dev/null +++ b/02_activities/assignments/assignment_2.html @@ -0,0 +1,1030 @@ + + + + + + + + + +Assignment #2 + + + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+
+

Assignment #2

+
+ + + +
+ + + + +
+ + + +
+ + +

You will need PLINK2 installed and available in your PATH. Please follow the OS-specific setup guide in SETUP.md. The dataset for this assignment consists of the following binary PLINK files: gwa.A2.bed, gwa.A2.bim, gwa.A2.fam , available at the following Google Drive link: https://drive.google.com/drive/folders/1rHoy3z52Yyj985ukjjtLhfIBxchpyYtZ?usp=drive_link. Please download all three files and save them in 02_activities/data/.

+
+

Question 1: Data inspection

+

Before you run any models, first get familiar with the dataset. You may find data.table::fread() in R helpful for reading .bim and .fam files.

+
    +
  1. Read the .fam file. How many samples does the dataset contain?
  2. +
+
+
cd ../..
+
+
+
knitr::opts_knit$set(root.dir = normalizePath("../../"))
+
+
+
library(data.table)
+library(qqman)
+
+
+
+
+
For example usage please run: vignette('qqman')
+
+
+
+
+
+
Citation appreciated but not required:
+
+
+
Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.
+
+
+
+
+
fam <- data.table::fread("./02_activities/data/gwa.A2.fam")
+
+n_samples <- nrow(fam)
+n_samples
+
+
[1] 4000
+
+
+
    +
  1. Read the .bim file. How many SNPs does the dataset contain?
  2. +
+
+
bim <- data.table::fread("./02_activities/data/gwa.A2.bim")
+
+n_snps <- nrow(bim)
+n_snps
+
+
[1] 306102
+
+
+
    +
  1. Make a histogram (or density plot) of the phenotype. Does it look roughly Gaussian?
  2. +
+
+
phenotype <- fam[[6]]
+
+hist(
+  phenotype,
+  main = "Histogram of Phenotype",
+  xlab = "Phenotype Value",
+
+ col = "lightblue",
+  border = "white"
+)
+
+
+
+

+
+
+
+
+
+
+

Question 2: Quality control (QC)

+

Now we will perform QC using PLINK2 for the genotype files in gwa.A2.

+
    +
  1. Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (--geno) ≤ 0.01, individual missingness (--mind) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset as gwa.qc.A2.
  2. +
+
+

+plink2 \
+  --bfile ./02_activities/data/gwa.A2 \
+  --maf 0.05 \
+  --geno 0.01 \
+  --mind 0.10 \
+  --hwe 0.00005 \
+  --make-bed \
+  --out ./02_activities/data/gwa.qc.A2
+
+
PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.A2
+  --geno 0.01
+  --hwe 0.00005
+  --maf 0.05
+  --make-bed
+  --mind 0.10
+  --out ./02_activities/data/gwa.qc.A2
+
+Start time: Thu Apr 02 14:08:29 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.A2.fam.
+306102 variants loaded from ./02_activities/data/gwa.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating sample missingness rates... 0%21%42%64%85%done.
+0 samples removed due to missing genotype data (--mind).
+4000 samples (2000 females, 2000 males; 4000 founders) remaining after main
+filters.
+4000 quantitative phenotype values remaining after main filters.
+Calculating allele frequencies... 0%21%42%64%85%done.
+--geno: 196578 variants removed due to missing genotype data.
+--hwe: 6 variants removed due to Hardy-Weinberg exact test (founders only).
+8435 variants removed due to allele frequency threshold(s)
+(--maf/--max-maf/--mac/--max-mac).
+101083 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc.A2.bed ... 0%21%43%65%87%done.
+End time: Thu Apr 02 14:08:30 2026
+
+
+
+
+

Question 3: Relatedness

+

In this question, you will use PLINK2’s built-in KING-robust kinship (--king-cutoff) detect and remove related individuals.

+
    +
  1. Perform LD pruning on gwa.qc.A2 using PLINK2 with the following parameters: --indep-pairwise 500 50 0.05, and then generate a new dataset containing only the pruned SNPs.
  2. +
+
+

+plink2 \
+  --bfile ./02_activities/data/gwa.qc.A2 \
+  --indep-pairwise 500 50 0.05 \
+  --out ./02_activities/data/gwa.qc.A2
+
+  plink2 \
+  --bfile ./02_activities/data/gwa.qc.A2 \
+  --extract ./02_activities/data/gwa.qc.A2.prune.in \
+  --make-bed \
+  --out ./02_activities/data/gwa.qc_pruned.A2
+
+
PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc.A2
+  --indep-pairwise 500 50 0.05
+  --out ./02_activities/data/gwa.qc.A2
+
+Start time: Thu Apr 02 14:08:30 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating allele frequencies... 0%64%done.
+--indep-pairwise (13 compute threads): 0%50%79169/101083 variants removed.
+Writing...
+Variant lists written to ./02_activities/data/gwa.qc.A2.prune.in and
+./02_activities/data/gwa.qc.A2.prune.out .
+End time: Thu Apr 02 14:08:30 2026
+PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_pruned.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc.A2
+  --extract ./02_activities/data/gwa.qc.A2.prune.in
+  --make-bed
+  --out ./02_activities/data/gwa.qc_pruned.A2
+
+Start time: Thu Apr 02 14:08:30 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 21914 variants remaining.
+21914 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc_pruned.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc_pruned.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc_pruned.A2.bed ... 0%61%done.
+End time: Thu Apr 02 14:08:31 2026
+
+
+
    +
  1. Use PLINK2 on the LD-pruned dataset to identify a set of unrelated individuals up to (approximately) 2nd-degree relatives (use a kinship cutoff of 0.0884).
  2. +
+
+

+plink2 \
+  --bfile ./02_activities/data/gwa.qc_pruned.A2 \
+  --king-cutoff 0.0884 \
+  --out ./02_activities/data/gwa_qc.A2
+
+
PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa_qc.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc_pruned.A2
+  --king-cutoff 0.0884
+  --out ./02_activities/data/gwa_qc.A2
+
+Start time: Thu Apr 02 14:08:31 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_pruned.A2.fam.
+21914 variants loaded from ./02_activities/data/gwa.qc_pruned.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--king-cutoff pass 1/1: Scanning for rare variants... 0%done.
+0 variants handled by initial scan (21914 remaining).
+
+--king-cutoff pass 1/1: 0 variants complete.
+--king-cutoff pass 1/1: 1024 variants complete.
+--king-cutoff pass 1/1: 2048 variants complete.
+--king-cutoff pass 1/1: 3072 variants complete.
+--king-cutoff pass 1/1: 4096 variants complete.
+--king-cutoff pass 1/1: 5120 variants complete.
+--king-cutoff pass 1/1: 6144 variants complete.
+--king-cutoff pass 1/1: 7168 variants complete.
+--king-cutoff pass 1/1: 8192 variants complete.
+--king-cutoff pass 1/1: 9216 variants complete.
+--king-cutoff pass 1/1: 10240 variants complete.
+--king-cutoff pass 1/1: 11264 variants complete.
+--king-cutoff pass 1/1: 12288 variants complete.
+--king-cutoff pass 1/1: 13312 variants complete.
+--king-cutoff pass 1/1: 14336 variants complete.
+--king-cutoff pass 1/1: 15360 variants complete.
+--king-cutoff pass 1/1: 16384 variants complete.
+--king-cutoff pass 1/1: 17408 variants complete.
+--king-cutoff pass 1/1: 18432 variants complete.
+--king-cutoff pass 1/1: 19456 variants complete.
+--king-cutoff pass 1/1: 20480 variants complete.
+--king-cutoff pass 1/1: 21504 variants complete.
+--king-cutoff pass 1/1: Condensing...                 done.
+--king-cutoff: 21914 variants processed.
+--king-cutoff: Excluded sample IDs written to
+./02_activities/data/gwa_qc.A2.king.cutoff.out.id , and 4000 remaining sample
+IDs written to ./02_activities/data/gwa_qc.A2.king.cutoff.in.id .
+End time: Thu Apr 02 14:08:31 2026
+
+
+
    +
  1. Using the unrelated individual list from part (ii), create a dataset containing only unrelated individuals and name it gwa.qc_unrelated.A2.
  2. +
+
+

+plink2 \
+  --bfile ./02_activities/data/gwa.qc.A2 \
+  --keep ./02_activities/data/gwa_qc.A2.king.cutoff.in.id \
+  --make-bed \
+  --out ./02_activities/data/gwa.qc_unrelated.A2
+
+
PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_unrelated.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc.A2
+  --keep ./02_activities/data/gwa_qc.A2.king.cutoff.in.id
+  --make-bed
+  --out ./02_activities/data/gwa.qc_unrelated.A2
+
+Start time: Thu Apr 02 14:08:32 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--keep: 4000 samples remaining.
+4000 samples (2000 females, 2000 males; 4000 founders) remaining after main
+filters.
+4000 quantitative phenotype values remaining after main filters.
+Writing ./02_activities/data/gwa.qc_unrelated.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc_unrelated.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc_unrelated.A2.bed ... 0%64%done.
+End time: Thu Apr 02 14:08:32 2026
+
+
+
+
+

Question 4: principal component analysis (PCA)

+
    +
  1. Using PLINK2, perform PCA on the unrelated, LD-pruned dataset to obtain principal components.

    +

    (Hint: Refer to the PCA section in the GWAS Tutorial II. You should use a *.prune.in file to select a set of independent SNPs before performing PCA.)

  2. +
+
+

+plink2 \
+  --bfile ./02_activities/data/gwa.qc_unrelated.A2 \
+  --extract ./02_activities/data/gwa.qc.A2.prune.in \
+  --make-bed \
+  --out ./02_activities/data/gwa.qc_unrelated_pruned.A2
+
+
+plink2 \
+  --bfile ./02_activities/data/gwa.qc_unrelated_pruned.A2 \
+  --pca 20 \
+  --out ./02_activities/data/gwa_unrel_PCA.A2
+
+
PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_unrelated_pruned.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc_unrelated.A2
+  --extract ./02_activities/data/gwa.qc.A2.prune.in
+  --make-bed
+  --out ./02_activities/data/gwa.qc_unrelated_pruned.A2
+
+Start time: Thu Apr 02 14:08:32 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_unrelated.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc_unrelated.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 21914 variants remaining.
+21914 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc_unrelated_pruned.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc_unrelated_pruned.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc_unrelated_pruned.A2.bed ... 0%61%done.
+End time: Thu Apr 02 14:08:33 2026
+PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa_unrel_PCA.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc_unrelated_pruned.A2
+  --out ./02_activities/data/gwa_unrel_PCA.A2
+  --pca 20
+
+Start time: Thu Apr 02 14:08:33 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_unrelated_pruned.A2.fam.
+21914 variants loaded from ./02_activities/data/gwa.qc_unrelated_pruned.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating allele frequencies... 0%done.
+Constructing GRM: 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
+Correcting for missingness... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
+Extracting eigenvalues and eigenvectors... done.
+--pca: Eigenvectors written to ./02_activities/data/gwa_unrel_PCA.A2.eigenvec ,
+and eigenvalues written to ./02_activities/data/gwa_unrel_PCA.A2.eigenval .
+End time: Thu Apr 02 14:08:36 2026
+
+
+
+
+

Question 5: GWAS analyses

+

We now test for association between each SNP and the continuous phenotype, adjusting for population structure using the top 3 PCs.

+

Assume that:

+
    +
  • You will run GWAS on the unrelated, QC’ed dataset gwa.qc_unrelated.A2.

  • +
  • You have a covariate file (e.g., the *.eigenvec output from the PCA analysis) containing PC1–PC3 for each individual.

  • +
+
    +
  1. Using PLINK2, run a linear regression GWAS using gwa.qc_unrelated.A2 as the input dataset. Adjust for PCs 1–3 as covariates.
  2. +
+
+

+plink2 \
+  --bfile ./02_activities/data/gwa.qc_unrelated.A2 \
+  --linear \
+  --covar ./02_activities/data/gwa_unrel_PCA.A2.eigenvec \
+  --covar-name PC1,PC2,PC3 \
+  --ci 0.95 \
+  --out ./02_activities/data/gwa.qc_assoc.A2
+
+
PLINK v2.0.0-a.7 AVX2 (10 Jan 2026)                 cog-genomics.org/plink/2.0/
+(C) 2005-2026 Shaun Purcell, Christopher Chang    GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc_assoc.A2.log.
+Options in effect:
+  --bfile ./02_activities/data/gwa.qc_unrelated.A2
+  --ci 0.95
+  --covar ./02_activities/data/gwa_unrel_PCA.A2.eigenvec
+  --covar-name PC1,PC2,PC3
+  --glm
+  --out ./02_activities/data/gwa.qc_assoc.A2
+
+Start time: Thu Apr 02 14:08:36 2026
+16072 MiB RAM detected; reserving 8036 MiB for main workspace.
+Using up to 28 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc_unrelated.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc_unrelated.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+3 covariates loaded from ./02_activities/data/gwa_unrel_PCA.A2.eigenvec.
+Calculating allele frequencies... 0%64%done.
+--glm linear regression on phenotype 'PHENO1': 0%64%done.
+Results written to ./02_activities/data/gwa.qc_assoc.A2.PHENO1.glm.linear .
+End time: Thu Apr 02 14:08:38 2026
+
+
+
    +
  1. Create a Manhattan plot of the GWAS results.
  2. +
+
+
library(data.table)
+library(qqman)
+
+assoc <- fread("./02_activities/data/gwa.qc_assoc.A2.PHENO1.glm.linear")
+
+# Keep additive model only
+assoc_add <- assoc[TEST == "ADD"]
+
+# Rename columns for qqman
+setnames(assoc_add, c("#CHROM","POS","ID"), c("CHR","BP","SNP"))
+
+# Manhattan plot
+png("./02_activities/data/manhattan_plot_A2.png", width = 1200, height = 800, res = 150)
+manhattan(
+  assoc_add,
+
+chr = "CHR",
+  bp  = "BP",
+  snp = "SNP",
+  p   = "P",
+  suggestiveline = FALSE,
+  col = c("lightblue", "lightslateblue")
+)
+dev.off()
+
+
png 
+  2 
+
+
+
    +
  1. Create a QQ plot of the GWAS p-values.
  2. +
+
+
png("./02_activities/data/QQ_plot_A2.png", width = 1200, height = 800, res = 150)
+qq(assoc_add$P, main = "Q-Q Plot of GWAS P-values (Assignment 2)")
+dev.off()
+
+
png 
+  2 
+
+
+
+
+

Criteria

+ +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CriteriaCompleteIncomplete
Data inspectionCorrect sample and SNP counts; phenotype plot with a brief comment on Gaussianity.Counts or phenotype description/plot missing or incorrect.
QC & LD pruningCorrect PLINK2 QC command and thresholds.QC/pruning commands, thresholds, or output datasets missing or incorrect.
Relatedness & PCACorrect use of PLINK2 command to obtain unrelated samples and PCA run on pruned SNPs.Relatedness step, unrelated dataset, or PCA analysis missing or incorrect.
GWAS & visualisationLinear regression GWAS with PCs as covariates; Manhattan and QQ plots produced.GWAS command, or Manhattan/QQ plots missing or clearly incorrect.
+
+
+

Submission Information

+

🚨 Please review our Assignment Submission Guide 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

+
+
+

Submission Parameters

+
    +
  • Submission Due Date: 11:59 PM – 01/04/2026

  • +
  • The branch name for your repo should be: assignment-2

  • +
  • What to submit for this assignment:

    +
      +
    • Populate this Quarto document (assignment_2.qmd).
    • +
    • Render the document with Quarto: quarto render assignment_2.qmd.
    • +
    • Submit both assignment_2.qmd and the rendered HTML file assignment_2.html in your pull request.
    • +
  • +
  • What the pull request link should look like for this assignment: https://github.com/<your_github_username>/gen_data/pull/<pr_id>

    +
      +
    • Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily.
    • +
  • +
+
+

Checklist:

+
    +
  • Create a branch called assignment-2.
  • +
  • Ensure that your repository is public.
  • +
  • Review the PR description guidelines and adhere to them.
  • +
  • Verify that your link is accessible in a private browser window.
  • +
  • Confirm that both assignment_2.qmd and assignment_2.html are included in the pull request.
  • +
+
+
+ +
+ + +
+ + + + + \ No newline at end of file diff --git a/02_activities/assignments/assignment_2.qmd b/02_activities/assignments/assignment_2.qmd index fd93722..f327ee3 100644 --- a/02_activities/assignments/assignment_2.qmd +++ b/02_activities/assignments/assignment_2.qmd @@ -1,138 +1,246 @@ ---- -title: "Assignment #2" -format: html ---- - -You will need **PLINK2** installed and available in your PATH. Please follow the OS-specific setup guide in [`SETUP.md`](../../SETUP.md). The dataset for this assignment consists of the following binary PLINK files: `gwa.A2.bed`, `gwa.A2.bim`, `gwa.A2.fam` , available at the following Google Drive link: . Please download all three files and save them in `02_activities/data/`. - -#### Question 1: Data inspection - -Before you run any models, first get familiar with the dataset. You may find `data.table::fread()` in R helpful for reading `.bim` and `.fam` files. - -(i) Read the `.fam` file. How many samples does the dataset contain? - -```{r} -# Your code here... -``` - -(ii) Read the `.bim` file. How many SNPs does the dataset contain? - -```{r} -# Your code here... -``` - -(iii) Make a histogram (or density plot) of the phenotype. Does it look roughly Gaussian? - -```{r} -# Your code here... -``` - -#### Question 2: Quality control (QC) - -Now we will perform QC using PLINK2 for the genotype files in `gwa.A2`. - -(i) Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (`--geno`) ≤ 0.01, individual missingness (`--mind`) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset as `gwa.qc.A2`. - -```{bash} -# Your code here... -``` - -#### Question 3: Relatedness - -In this question, you will use **PLINK2’s built-in KING-robust kinship** (`--king-cutoff`) detect and remove related individuals. - -i) Perform LD pruning on `gwa.qc.A2` using PLINK2 with the following parameters: `--indep-pairwise 500 50 0.05`, and then generate a new dataset containing only the pruned SNPs. - -```{bash} -# Your code here... -``` - -(ii) Use PLINK2 on the LD-pruned dataset to identify a set of unrelated individuals up to (approximately) 2nd-degree relatives (use a kinship cutoff of 0.0884). - -```{bash} -# Your code here... -``` - -(iii) Using the unrelated individual list from part (ii), create a dataset containing only unrelated individuals and name it `gwa.qc_unrelated.A2`. - -```{bash} -# Your code here... -``` - -#### Question 4: principal component analysis (PCA) - -(i) Using PLINK2, perform PCA on the unrelated, LD-pruned dataset to obtain principal components. - - (Hint: Refer to the PCA section in the GWAS Tutorial II. You should use a `*.prune.in` file to select a set of independent SNPs before performing PCA.) - -```{bash} -# Your code here... -``` - -#### Question 5: GWAS analyses - -We now test for association between each SNP and the continuous phenotype, adjusting for population structure using the top 3 PCs. - -Assume that: - -- You will run GWAS on the unrelated, QC’ed dataset `gwa.qc_unrelated.A2`. - -- You have a covariate file (e.g., the `*.eigenvec` output from the PCA analysis) containing PC1–PC3 for each individual. - -(i) Using PLINK2, run a linear regression GWAS using `gwa.qc_unrelated.A2` as the input dataset. Adjust for PCs 1–3 as covariates. - -```{bash} -# Your code here... -``` - -(ii) Create a Manhattan plot of the GWAS results. - -```{r} -# Your code here... -``` - -(iii) Create a QQ plot of the GWAS p-values. - -```{r} -# Your code here... -``` - -### Criteria - -| Criteria | Complete | Incomplete | -|:--------------|:----------------------------------|:----------------------| -| **Data inspection** | Correct sample and SNP counts; phenotype plot with a brief comment on Gaussianity. | Counts or phenotype description/plot missing or incorrect. | -| **QC & LD pruning** | Correct PLINK2 QC command and thresholds. | QC/pruning commands, thresholds, or output datasets missing or incorrect. | -| **Relatedness & PCA** | Correct use of PLINK2 command to obtain unrelated samples and PCA run on pruned SNPs. | Relatedness step, unrelated dataset, or PCA analysis missing or incorrect. | -| **GWAS & visualisation** | Linear regression GWAS with PCs as covariates; Manhattan and QQ plots produced. | GWAS command, or Manhattan/QQ plots missing or clearly incorrect. | - -### Submission Information - -🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly. - ------------------------------------------------------------------------- - -#### Submission Parameters - -- Submission Due Date: 11:59 PM – 01/04/2026 - -- The branch name for your repo should be: `assignment-2` - -- What to submit for this assignment: - - Populate this Quarto document (`assignment_2.qmd`). - - Render the document with Quarto: `quarto render assignment_2.qmd`. - - Submit both `assignment_2.qmd` and the rendered HTML file `assignment_2.html` in your pull request. - -- What the pull request link should look like for this assignment: `https://github.com//gen_data/pull/` - - - Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily. - ------------------------------------------------------------------------- - -Checklist: - -- Create a branch called `assignment-2`. -- Ensure that your repository is public. -- Review [the PR description guidelines](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md#guidelines-for-pull-request-descriptions) and adhere to them. -- Verify that your link is accessible in a private browser window. -- Confirm that both `assignment_2.qmd` and `assignment_2.html` are included in the pull request. +--- +title: "Assignment #2" +format: html +--- + +You will need **PLINK2** installed and available in your PATH. Please follow the OS-specific setup guide in [`SETUP.md`](../../SETUP.md). The dataset for this assignment consists of the following binary PLINK files: `gwa.A2.bed`, `gwa.A2.bim`, `gwa.A2.fam` , available at the following Google Drive link: . Please download all three files and save them in `02_activities/data/`. + +#### Question 1: Data inspection + +Before you run any models, first get familiar with the dataset. You may find `data.table::fread()` in R helpful for reading `.bim` and `.fam` files. + +(i) Read the `.fam` file. How many samples does the dataset contain? +```{bash} +cd ../.. +``` +```{r} +knitr::opts_knit$set(root.dir = normalizePath("../../")) +``` +```{r} +library(data.table) +library(qqman) +fam <- data.table::fread("./02_activities/data/gwa.A2.fam") + +n_samples <- nrow(fam) +n_samples + +``` + +(ii) Read the `.bim` file. How many SNPs does the dataset contain? + +```{r} +bim <- data.table::fread("./02_activities/data/gwa.A2.bim") + +n_snps <- nrow(bim) +n_snps + +``` + +(iii) Make a histogram (or density plot) of the phenotype. Does it look roughly Gaussian? + +```{r} +phenotype <- fam[[6]] + +hist( + phenotype, + main = "Histogram of Phenotype", + xlab = "Phenotype Value", + + col = "lightblue", + border = "white" +) + + +``` + +#### Question 2: Quality control (QC) + +Now we will perform QC using PLINK2 for the genotype files in `gwa.A2`. + +(i) Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (`--geno`) ≤ 0.01, individual missingness (`--mind`) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset as `gwa.qc.A2`. + +```{bash} + +plink2 \ + --bfile ./02_activities/data/gwa.A2 \ + --maf 0.05 \ + --geno 0.01 \ + --mind 0.10 \ + --hwe 0.00005 \ + --make-bed \ + --out ./02_activities/data/gwa.qc.A2 + +``` + +#### Question 3: Relatedness + +In this question, you will use **PLINK2’s built-in KING-robust kinship** (`--king-cutoff`) detect and remove related individuals. + +i) Perform LD pruning on `gwa.qc.A2` using PLINK2 with the following parameters: `--indep-pairwise 500 50 0.05`, and then generate a new dataset containing only the pruned SNPs. + +```{bash} + +plink2 \ + --bfile ./02_activities/data/gwa.qc.A2 \ + --indep-pairwise 500 50 0.05 \ + --out ./02_activities/data/gwa.qc.A2 + + plink2 \ + --bfile ./02_activities/data/gwa.qc.A2 \ + --extract ./02_activities/data/gwa.qc.A2.prune.in \ + --make-bed \ + --out ./02_activities/data/gwa.qc_pruned.A2 + +``` + +(ii) Use PLINK2 on the LD-pruned dataset to identify a set of unrelated individuals up to (approximately) 2nd-degree relatives (use a kinship cutoff of 0.0884). + +```{bash} + +plink2 \ + --bfile ./02_activities/data/gwa.qc_pruned.A2 \ + --king-cutoff 0.0884 \ + --out ./02_activities/data/gwa_qc.A2 + + +``` + +(iii) Using the unrelated individual list from part (ii), create a dataset containing only unrelated individuals and name it `gwa.qc_unrelated.A2`. + +```{bash} + +plink2 \ + --bfile ./02_activities/data/gwa.qc.A2 \ + --keep ./02_activities/data/gwa_qc.A2.king.cutoff.in.id \ + --make-bed \ + --out ./02_activities/data/gwa.qc_unrelated.A2 + +``` + +#### Question 4: principal component analysis (PCA) + +(i) Using PLINK2, perform PCA on the unrelated, LD-pruned dataset to obtain principal components. + + (Hint: Refer to the PCA section in the GWAS Tutorial II. You should use a `*.prune.in` file to select a set of independent SNPs before performing PCA.) + +```{bash} + +plink2 \ + --bfile ./02_activities/data/gwa.qc_unrelated.A2 \ + --extract ./02_activities/data/gwa.qc.A2.prune.in \ + --make-bed \ + --out ./02_activities/data/gwa.qc_unrelated_pruned.A2 + + +plink2 \ + --bfile ./02_activities/data/gwa.qc_unrelated_pruned.A2 \ + --pca 20 \ + --out ./02_activities/data/gwa_unrel_PCA.A2 + + +``` + +#### Question 5: GWAS analyses + +We now test for association between each SNP and the continuous phenotype, adjusting for population structure using the top 3 PCs. + +Assume that: + +- You will run GWAS on the unrelated, QC’ed dataset `gwa.qc_unrelated.A2`. + +- You have a covariate file (e.g., the `*.eigenvec` output from the PCA analysis) containing PC1–PC3 for each individual. + +(i) Using PLINK2, run a linear regression GWAS using `gwa.qc_unrelated.A2` as the input dataset. Adjust for PCs 1–3 as covariates. + +```{bash} + +plink2 \ + --bfile ./02_activities/data/gwa.qc_unrelated.A2 \ + --linear \ + --covar ./02_activities/data/gwa_unrel_PCA.A2.eigenvec \ + --covar-name PC1,PC2,PC3 \ + --ci 0.95 \ + --out ./02_activities/data/gwa.qc_assoc.A2 + +``` + +(ii) Create a Manhattan plot of the GWAS results. + +```{r} + +library(data.table) +library(qqman) + +assoc <- fread("./02_activities/data/gwa.qc_assoc.A2.PHENO1.glm.linear") + +# Keep additive model only +assoc_add <- assoc[TEST == "ADD"] + +# Rename columns for qqman +setnames(assoc_add, c("#CHROM","POS","ID"), c("CHR","BP","SNP")) + +# Manhattan plot +png("./02_activities/data/manhattan_plot_A2.png", width = 1200, height = 800, res = 150) +manhattan( + assoc_add, + +chr = "CHR", + bp = "BP", + snp = "SNP", + p = "P", + suggestiveline = FALSE, + col = c("lightblue", "lightslateblue") +) +dev.off() + +``` + +(iii) Create a QQ plot of the GWAS p-values. + +```{r} + +png("./02_activities/data/QQ_plot_A2.png", width = 1200, height = 800, res = 150) +qq(assoc_add$P, main = "Q-Q Plot of GWAS P-values (Assignment 2)") +dev.off() + +``` + +### Criteria + +| Criteria | Complete | Incomplete | +|:--------------|:----------------------------------|:----------------------| +| **Data inspection** | Correct sample and SNP counts; phenotype plot with a brief comment on Gaussianity. | Counts or phenotype description/plot missing or incorrect. | +| **QC & LD pruning** | Correct PLINK2 QC command and thresholds. | QC/pruning commands, thresholds, or output datasets missing or incorrect. | +| **Relatedness & PCA** | Correct use of PLINK2 command to obtain unrelated samples and PCA run on pruned SNPs. | Relatedness step, unrelated dataset, or PCA analysis missing or incorrect. | +| **GWAS & visualisation** | Linear regression GWAS with PCs as covariates; Manhattan and QQ plots produced. | GWAS command, or Manhattan/QQ plots missing or clearly incorrect. | + +### Submission Information + +🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly. + +------------------------------------------------------------------------ + +#### Submission Parameters + +- Submission Due Date: 11:59 PM – 01/04/2026 + +- The branch name for your repo should be: `assignment-2` + +- What to submit for this assignment: + - Populate this Quarto document (`assignment_2.qmd`). + - Render the document with Quarto: `quarto render assignment_2.qmd`. + - Submit both `assignment_2.qmd` and the rendered HTML file `assignment_2.html` in your pull request. + +- What the pull request link should look like for this assignment: `https://github.com//gen_data/pull/` + + - Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily. + +------------------------------------------------------------------------ + +Checklist: + +- Create a branch called `assignment-2`. +- Ensure that your repository is public. +- Review [the PR description guidelines](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md#guidelines-for-pull-request-descriptions) and adhere to them. +- Verify that your link is accessible in a private browser window. +- Confirm that both `assignment_2.qmd` and `assignment_2.html` are included in the pull request. diff --git a/02_activities/assignments/manhattan_plot_A2.png b/02_activities/assignments/manhattan_plot_A2.png new file mode 100644 index 0000000..c011183 Binary files /dev/null and b/02_activities/assignments/manhattan_plot_A2.png differ