You will need PLINK2 installed and available in your PATH. Please follow the OS-specific setup guide in SETUP.md. The dataset for this assignment consists of the following binary PLINK files: gwa.A2.bed, gwa.A2.bim, gwa.A2.fam , available at the following Google Drive link: https://drive.google.com/drive/folders/1rHoy3z52Yyj985ukjjtLhfIBxchpyYtZ?usp=drive_link. Please download all three files and save them in 02_activities/data/.
+
+
Question 1: Data inspection
+
Before you run any models, first get familiar with the dataset. You may find data.table::fread() in R helpful for reading .bim and .fam files.
+
+
Read the .fam file. How many samples does the dataset contain?
+
+
+
head ./02_activities/data/gwa.A2.fam
+wc-l ./02_activities/data/gwa.A2.fam
Read the .bim file. How many SNPs does the dataset contain?
+
+
+
head ./02_activities/data/gwa.A2.bim
+wc-l ./02_activities/data/gwa.A2.bim
+
+
1 rs3934834 0 995669 T C
+1 rs3737728 0 1011278 A G
+1 rs6687776 0 1020428 T C
+1 rs9651273 0 1021403 A G
+1 rs4970405 0 1038818 G A
+1 rs12726255 0 1039813 G A
+1 rs2298217 0 1054842 T C
+1 rs4970357 0 1066927 C A
+1 rs4970362 0 1084601 A G
+1 rs9660710 0 1089205 A C
+ 306102 ./02_activities/data/gwa.A2.bim
+
+
+
There are 306102 SNPs in the dataset.
+
+
+
Question 2: Quality control (QC)
+
Now we will perform QC using PLINK2 for the genotype files in gwa.A2.
+
+
Using PLINK2 from the command line (bash), perform basic QC with the following filters: MAF ≥ 0.05, SNP missingness (--geno) ≤ 0.01, individual missingness (--mind) ≤ 0.10, and HWE p-value ≥ 0.00005, and output the QC’ed dataset as gwa.qc.A2.
PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc.A2.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.A2
+ --geno 0.01
+ --hwe 0.00005
+ --maf 0.05
+ --make-bed
+ --mind 0.10
+ --out ./02_activities/data/gwa.qc.A2
+
+Start time: Thu Apr 2 01:20:35 2026
+49152 MiB RAM detected; reserving 24576 MiB for main workspace.
+Using up to 14 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.A2.fam.
+306102 variants loaded from ./02_activities/data/gwa.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+Calculating sample missingness rates... 0%21%42%64%85%done.
+0 samples removed due to missing genotype data (--mind).
+4000 samples (2000 females, 2000 males; 4000 founders) remaining after main
+filters.
+4000 quantitative phenotype values remaining after main filters.
+Calculating allele frequencies... 0%21%42%64%85%done.
+--geno: 196578 variants removed due to missing genotype data.
+--hwe: 6 variants removed due to Hardy-Weinberg exact test (founders only).
+8435 variants removed due to allele frequency threshold(s)
+(--maf/--max-maf/--mac/--max-mac).
+101083 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc.A2.fam ... done.
+Writing ./02_activities/data/gwa.qc.A2.bim ... done.
+Writing ./02_activities/data/gwa.qc.A2.bed ... 0%21%43%65%87%done.
+End time: Thu Apr 2 01:20:35 2026
+
+
+
+
+
Question 3: Relatedness
+
In this question, you will use PLINK2’s built-in KING-robust kinship (--king-cutoff) detect and remove related individuals.
+
+
Perform LD pruning on gwa.qc.A2 using PLINK2 with the following parameters: --indep-pairwise 500 50 0.05, and then generate a new dataset containing only the pruned SNPs.
PLINK v2.00a5.12 M1 (25 Jun 2024) www.cog-genomics.org/plink/2.0/
+(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
+Logging to ./02_activities/data/gwa.qc.A2.pruned.log.
+Options in effect:
+ --bfile ./02_activities/data/gwa.qc.A2
+ --extract ./02_activities/data/gwa.qc.A2.prune.in
+ --make-bed
+ --out ./02_activities/data/gwa.qc.A2.pruned
+
+Start time: Thu Apr 2 01:20:35 2026
+49152 MiB RAM detected; reserving 24576 MiB for main workspace.
+Using up to 14 threads (change this with --threads).
+4000 samples (2000 females, 2000 males; 4000 founders) loaded from
+./02_activities/data/gwa.qc.A2.fam.
+101083 variants loaded from ./02_activities/data/gwa.qc.A2.bim.
+1 quantitative phenotype loaded (4000 values).
+--extract: 21914 variants remaining.
+21914 variants remaining after main filters.
+Writing ./02_activities/data/gwa.qc.A2.pruned.fam ... done.
+Writing ./02_activities/data/gwa.qc.A2.pruned.bim ... done.
+Writing ./02_activities/data/gwa.qc.A2.pruned.bed ... 0%61%done.
+End time: Thu Apr 2 01:20:35 2026
+
+
+
+
Use PLINK2 on the LD-pruned dataset to identify a set of unrelated individuals up to (approximately) 2nd-degree relatives (use a kinship cutoff of 0.0884).
#CHROM POS ID REF ALT PROVISIONAL_REF? A1 OMITTED
+ <int> <int> <char> <char> <char> <char> <char> <char>
+1: 1 1011278 rs3737728 G A Y A G
+2: 1 1011278 rs3737728 G A Y A G
+3: 1 1011278 rs3737728 G A Y A G
+4: 1 1011278 rs3737728 G A Y A G
+5: 1 1109721 rs1320565 C T Y T C
+6: 1 1109721 rs1320565 C T Y T C
+ A1_FREQ TEST OBS_CT BETA SE L95 U95 T_STAT
+ <num> <char> <int> <num> <num> <num> <num> <num>
+1: 0.3386490 ADD 3982 0.0182086 0.0240795 -0.0289864 0.0654035 0.756187
+2: 0.3386490 PC1 3982 -0.3779240 1.0015500 -2.3409200 1.5850800 -0.377340
+3: 0.3386490 PC2 3982 0.9818050 1.0015100 -0.9811150 2.9447300 0.980326
+4: 0.3386490 PC3 3982 0.4832040 1.0015300 -1.4797500 2.4461600 0.482467
+5: 0.0788481 ADD 3976 -0.0153297 0.0420641 -0.0977739 0.0671145 -0.364436
+6: 0.0788481 PC1 3976 -0.4762890 1.0037500 -2.4436000 1.4910200 -0.474511
+ P ERRCODE
+ <num> <char>
+1: 0.449582 .
+2: 0.705941 .
+3: 0.326985 .
+4: 0.629501 .
+5: 0.715552 .
+6: 0.635162 .
+
+
assoc_adjusted <- data.table::fread("./02_activities/data/gwa.qc_A2_assoc.PHENO1.glm.linear.adjusted")
+
+assoc_add <- assoc[TEST =="ADD"]
+
+# qqman expects columns named CHR (chromosome), BP (base pair), SNP, and P (p-value).
+# In PLINK2 output they are #CHROM, POS, ID, and P.
+setnames(assoc_add, c("#CHROM","POS","ID"), c("CHR","BP","SNP"))
+
+# Manhattan plot
+png("./02_activities/data/manhattan_plot_A2.png", width =1200, height =800, res =150)
+
+manhattan(assoc_add,
+chr ="CHR",
+bp ="BP",
+snp ="SNP",
+p ="P",
+xlab ="",
+ylab ="",
+suggestiveline =FALSE,
+cex.axis =1.5,
+col =c("lightblue", "lightslateblue"),
+annotatePval=5e-5)
+dev.off()
+
+
quartz_off_screen
+ 2
+
+
+
+
Create a QQ plot of the GWAS p-values.
+
+
+
# Q-Q plot
+png("./02_activities/data/QQ_plot_A2.png", width =1200, height =800, res =150)
+qq(assoc_add$P, main ="Q-Q plot of GWAS p-values")
+dev.off()
+
+
quartz_off_screen
+ 2
+
+
+
+
+
Criteria
+
+
+
+
+
+
+
+
+
Criteria
+
Complete
+
Incomplete
+
+
+
+
+
Data inspection
+
Correct sample and SNP counts
+
Counts or phenotype description/plot missing or incorrect.
+
+
+
QC & LD pruning
+
Correct PLINK2 QC command and thresholds.
+
QC/pruning commands, thresholds, or output datasets missing or incorrect.
+
+
+
Relatedness & PCA
+
Correct use of PLINK2 command to obtain unrelated samples and PCA run on pruned SNPs.
+
Relatedness step, unrelated dataset, or PCA analysis missing or incorrect.
+
+
+
GWAS & visualisation
+
Linear regression GWAS with PCs as covariates; Manhattan and QQ plots produced.
+
GWAS command, or Manhattan/QQ plots missing or clearly incorrect.
+
+
+
+
+
+
Submission Information
+
🚨 Please review our Assignment Submission Guide 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.
+
+
+
Submission Parameters
+
+
Submission Due Date: 11:59 PM – 01/04/2026
+
The branch name for your repo should be: assignment-2
+
What to submit for this assignment:
+
+
Populate this Quarto document (assignment_2.qmd).
+
Render the document with Quarto: quarto render assignment_2.qmd.
+
Submit both assignment_2.qmd, the rendered HTML file assignment_2.html and saved figures in your pull request.
+
+
What the pull request link should look like for this assignment: https://github.com/<your_github_username>/gen_data/pull/<pr_id>
+
+
Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily.