library(data.table)
+library(ggplot2)
+library(seqminer)
+library(HardyWeinberg)
+library(dplyr)Assignment #1
+Assignment 1
+You only need to write lines of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesn’t matter. For example, if the answer is just ‘yes,’ ‘no,’ or a number, you can just give that answer without adding anything else.
+We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help help() function in R, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).
+Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.
+You will need to install PLINK and run the analyses. Please follow the OS-specific setup guide in SETUP.md. PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
Question 1: Data inspection
+Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the gwa.qc.A1.fam, gwa.qc.A1.bim, and gwa.qc.A1.bed files, available at the following Google Drive link: https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link. Please download all three files from this link and place them in 02_activities/data/.
-
+
- Read the .fam file. How many samples does the dataset contain? +
wc -l ./02_activities/data/gwa.qc.A1.fam 4000 ./02_activities/data/gwa.qc.A1.fam
+We have 4000 samples in the dataset.
+-
+
- What is the ‘variable type’ of the response variable (i.e.Continuous or binary)? +
head ./02_activities/data/gwa.qc.A1.fam0 A2001 0 0 1 -0.694438129641973
+1 A2002 0 0 1 1.85384536141856
+2 A2003 0 0 1 2.08263677761584
+3 A2004 0 0 1 2.73871473943968
+4 A2005 0 0 1 1.34114035564636
+5 A2006 0 0 1 0.416778586749647
+6 A2007 0 0 1 2.38297123290054
+7 A2008 0 0 1 1.51429928826958
+8 A2009 0 0 1 0.718686390529039
+9 A2010 0 0 1 2.08904136245205
+The variable type of the response variable is continuous, as indicated by the values in the sixth column of the .fam file, which represent the phenotype values for each sample.
+-
+
- Read the .bim file. How many SNPs does the dataset contain? +
wc -l ./02_activities/data/gwa.qc.A1.bim 101083 ./02_activities/data/gwa.qc.A1.bim
+We have 101083 SNPs in the dataset.
+Question 2: Allele Frequency Estimation
+-
+
- Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs? +
#plink2 --bfile ./02_activities/data/gwa.qc.A1 --freq --out ./02_activities/data/gwa_qc_A1_freq
+printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ./02_activities/data/snplist_A1.txt
+plink2 --bfile ./02_activities/data/gwa.qc.A1 \
+ --extract ./02_activities/data/snplist_A1.txt \
+ --freq \
+ --out ./02_activities/data/gwa_qc_A1_freqfreq <- fread("./02_activities/data/gwa_qc_A1_freq.afreq")
+head(freq) #CHROM ID REF ALT PROVISIONAL_REF? ALT_FREQS OBS_CT
+ <int> <char> <char> <char> <char> <num> <int>
+1: 1 rs3813199 G A Y 0.0569126 7942
+2: 1 rs11804831 T C Y 0.1543410 7924
+3: 1 rs3128342 C A Y 0.3051210 7928
+4: 1 rs1861 C A Y 0.0539859 7928
+I am not fully sure about the question, but from the table we can see that the alternative allele frequencies (AFs) for the four SNPs are as follows: rs3813199: AF = 0.0569 rs11804831: AF = 0.1543 rs3128342: AF = 0.3051 rs1861: AF = 0.0540 If you are asking for the reference allele frequencies, you can calculate them as 1 - AF.
+-
+
- What are the minor allele frequencies (MAFs) for these four SNPs? +
As we can see the alternative allele frequencies (AFs) for the four SNPs are all less than 0.5, so the minor allele frequencies (MAFs) for these four SNPs are the same as their AFs: rs3813199: AF = 0.0569 rs11804831: AF = 0.1543 rs3128342: AF = 0.3051 rs1861: AF = 0.0540
+Question 3: Hardy–Weinberg Equilibrium (HWE) Test
+-
+
- Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame. +
plink2 --bfile ./02_activities/data/gwa.qc.A1 --hardy --out ./02_activities/data/gwa_qc_A1_hwe# View HWE results
+hwe <- fread("./02_activities/data/gwa_qc_A1_hwe.hardy")
+head(hwe) #CHROM ID A1 AX HOM_A1_CT HET_A1_CT TWO_AX_CT O(HET_A1)
+ <int> <char> <char> <char> <int> <int> <int> <num>
+1: 1 rs3737728 G A 1713 1841 428 0.462330
+2: 1 rs1320565 C T 3368 589 19 0.148139
+3: 1 rs3813199 G A 3531 428 12 0.107781
+4: 1 rs11804831 T C 2820 1061 81 0.267794
+5: 1 rs3766178 T C 2391 1378 214 0.345970
+6: 1 rs3128342 C A 1927 1655 382 0.417508
+ E(HET_A1) P
+ <num> <num>
+1: 0.447932 0.0437892
+2: 0.145262 0.2734290
+3: 0.107347 1.0000000
+4: 0.261040 0.1133540
+5: 0.350629 0.4158770
+6: 0.424044 0.3302730
+-
+
- What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831? +
# View HWE results
+hwe_subset <- hwe %>% filter(ID %in% c("rs1861", "rs3813199", "rs3128342", "rs11804831"))
+hwe_subset #CHROM ID A1 AX HOM_A1_CT HET_A1_CT TWO_AX_CT O(HET_A1)
+ <int> <char> <char> <char> <int> <int> <int> <num>
+1: 1 rs3813199 G A 3531 428 12 0.107781
+2: 1 rs11804831 T C 2820 1061 81 0.267794
+3: 1 rs3128342 C A 1927 1655 382 0.417508
+4: 1 rs1861 C A 3551 398 15 0.100404
+ E(HET_A1) P
+ <num> <num>
+1: 0.107347 1.000000
+2: 0.261040 0.113354
+3: 0.424044 0.330273
+4: 0.102143 0.274719
+The HWE p-values for the four SNPs are shown in the last column of the table above.
+Question 4: Genetic Association Test
+-
+
- Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value? +
plink2 --bfile ./02_activities/data/gwa.qc.A1 --export A --out ./02_activities/data/gwa_qc_A1_additivegeno <- fread("./02_activities/data/gwa_qc_A1_additive.raw")
+geno[1:5,1:10] FID IID PAT MAT SEX PHENOTYPE rs3737728_G rs1320565_C rs3813199_G
+ <int> <char> <int> <int> <int> <num> <int> <int> <int>
+1: 0 A2001 0 0 1 -0.694438 1 2 2
+2: 1 A2002 0 0 1 1.853850 1 2 2
+3: 2 A2003 0 0 1 2.082640 1 2 2
+4: 3 A2004 0 0 1 2.738710 2 2 2
+5: 4 A2005 0 0 1 1.341140 1 2 2
+ rs11804831_T
+ <int>
+1: 2
+2: 2
+3: 1
+4: 2
+5: 1
+model_additive <- lm(PHENOTYPE ~ rs1861_C, data = geno)
+summary(model_additive)
+Call:
+lm(formula = PHENOTYPE ~ rs1861_C, data = geno)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-3.5439 -0.6850 0.0021 0.6993 3.3268
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 0.05238 0.09486 0.552 0.581
+rs1861_C 0.97382 0.04943 19.703 <2e-16 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 1.003 on 3962 degrees of freedom
+ (36 observations deleted due to missingness)
+Multiple R-squared: 0.08924, Adjusted R-squared: 0.08901
+F-statistic: 388.2 on 1 and 3962 DF, p-value: < 2.2e-16
+The p-value for the association between SNP rs1861 and the phenotype is 2e-16.
+-
+
- How would you interpret the beta coefficient from this regression? +
Each additional copy of the C-G allele at rs1861 is associated with an increase of 0.974 units in the outcome variable of Phenotype.
+-
+
- Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot. +
geno_range <- data.frame(
+ rs1861_C = seq(0, 2, length.out = 100)
+ )
+geno_range$predicted_prob <- predict(model_additive, newdata = geno_range, type = "response")
+
+p <- ggplot(geno, aes(x = rs1861_C, y = PHENOTYPE)) +
+ geom_point(color = "blue", alpha = 0.6) +
+ geom_line(data = geno_range, aes(x = rs1861_C, y = predicted_prob), color = "red", size = 1) +
+ labs(
+ title = "Linear Regression: PHENOTYPE ~ Genotype",
+ x = "Genotype (0/1/2)",
+ y = "Phenotype"
+ ) +
+ theme_minimal()Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+print(p)Warning: Removed 36 rows containing missing values or values outside the scale range
+(`geom_point()`).
+
-
+
- Convert the genotype coding for rs1861 to recessive coding. +
geno_dom <- geno # make a copy
+geno_dom[, 7:ncol(geno_dom)] <- as.data.frame(
+ lapply(geno[, 7:ncol(geno)], function(x) ifelse(x == 2, 1, 0))
+)-
+
- Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value? +
model_recessive <- lm(PHENOTYPE ~ rs1861_C, data = geno_dom)
+summary(model_recessive)
+Call:
+lm(formula = PHENOTYPE ~ rs1861_C, data = geno_dom)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-3.5437 -0.6892 0.0015 0.7016 3.3270
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 0.99231 0.04945 20.07 <2e-16 ***
+rs1861_C 1.00754 0.05224 19.29 <2e-16 ***
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 1.005 on 3962 degrees of freedom
+ (36 observations deleted due to missingness)
+Multiple R-squared: 0.08582, Adjusted R-squared: 0.08559
+F-statistic: 372 on 1 and 3962 DF, p-value: < 2.2e-16
+The p-value for the association between the recessive-coded rs1861 and the phenotype is 2e-16.
+-
+
- Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot. +
geno_range <- data.frame(
+ rs1861_C = seq(0, 1, length.out = 100)
+ )
+geno_range$predicted_prob <- predict(model_recessive, newdata = geno_range, type = "response")
+
+p <- ggplot(geno_dom, aes(x = rs1861_C, y = PHENOTYPE)) +
+ geom_point(color = "blue", alpha = 0.6) +
+ geom_line(data = geno_range, aes(x = rs1861_C, y = predicted_prob), color = "red", size = 1) +
+ labs(
+ title = "Linear Regression: PHENOTYPE ~ Genotype",
+ x = "Genotype (0/1)",
+ y = "Phenotype"
+ ) +
+ theme_minimal()
+print(p)Warning: Removed 36 rows containing missing values or values outside the scale range
+(`geom_point()`).
+
-
+
- Which model fits better? Justify your answer. +
The additive model fits better than the recessive model. This is because the additive model has a higher adjusted R-squared value (0.08901) compared to the recessive model (0.08559 ), indicating that the additive model explains more variance in the phenotype. ### Criteria
+| Criteria | +Complete | +Incomplete | +
|---|---|---|
| Data Inspection | +Correct sample/SNP counts and variable type identified. | +Missing or incorrect counts or variable type. | +
| Allele Frequency Estimation | +Correct allele and minor allele frequencies computed. | +Frequencies missing or wrong. | +
| Hardy–Weinberg Equilibrium Test | +Correct PLINK command and p-value extraction in R. | +PLINK command or extraction incorrect/missing. | +
| Genetic Association Test | +Correct regressions, plots, coding, and interpretation. | +Regression, plots, or interpretation missing/incomplete. | +
Submission Information
+📌 Please review our Assignment Submission Guide for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.
+Note:
+If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.
++
Submission Parameters
+-
+
Submission Due Date:
11:59 PM – 16/03/2026
+Branch name for your repo should be:
assignment-1
+What to submit for this assignment:
+-
+
- Populate this Quarto document (
assignment_1.qmd).
+ - Render the document with Quarto:
quarto render assignment_1.qmd.
+ - Submit both
assignment_1.qmdand the rendered HTML fileassignment_1.htmlin your pull request.
+
+- Populate this Quarto document (
What the pull request link should look like for this assignment:
+https://github.com/<your_github_username>/gen_data/pull/<pr_id>-
+
- Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support team review your submission easily. +
+
+
Checklist:
+-
+
- Created a branch with the correct naming convention. +
- Ensured that the repository is public. +
- Reviewed the PR description guidelines and adhered to them. +
- Verified that the link is accessible in a private browser window. +
- Confirmed that both
assignment_1.qmdandassignment_1.htmlare included in the pull request.
+
If you encounter any difficulties or have questions, please don’t hesitate to reach out to our team via our Slack help channel. Our technical facilitators and learning support team are here to help you navigate any challenges.
+