siying-ma · siying-ma · Mar 17, 2026
diff --git a/02_activities/assignments/assignment_1.html b/02_activities/assignments/assignment_1.html
diff --git a/02_activities/assignments/assignment_1.qmd b/02_activities/assignments/assignment_1.qmd
@@ -17,96 +17,181 @@ You will need to install PLINK and run the analyses. Please follow the OS-specif
 
 Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the `gwa.qc.A1.fam`, `gwa.qc.A1.bim`, and `gwa.qc.A1.bed` files, available at the following Google Drive link: <https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link>. Please download all three files from this link and place them in `02_activities/data/`.
 
+```{r setup, include=FALSE}
+# Set up a consistent path for all chunks of code
+knitr::opts_knit$set(root.dir = normalizePath("../../"))
+```
+```{r, message=FALSE, warning=FALSE, results='hide'}
+library(data.table)
+library(ggplot2)
+library(seqminer)
+library(HardyWeinberg)
+library(dplyr)
+```
 (i) Read the .fam file. How many samples does the dataset contain?
 
-```         
-# Your answer here...
+```{bash}         
+wc -l ./02_activities/data/gwa.qc.A1.fam
 ```
 
+We have 4000 samples in the dataset.
+
+
 (ii) What is the 'variable type' of the response variable (i.e.Continuous or binary)?
 
-```         
-# Your answer here...
+```{bash}         
+head ./02_activities/data/gwa.qc.A1.fam
 ```
 
+The variable type of the response variable is continuous, as indicated by the values in the sixth column of the .fam file, which represent the phenotype values for each sample.
+
 (iii) Read the .bim file. How many SNPs does the dataset contain?
 
-```         
-# Your answer here...
+```{bash}         
+wc -l ./02_activities/data/gwa.qc.A1.bim
 ```
 
+We have 101083 SNPs in the dataset.
+
 #### Question 2: Allele Frequency Estimation
 
 (i) Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs?
 
-```         
-# Your code here...
+```{bash,results='hide',message=FALSE, warning=FALSE}
+#plink2 --bfile ./02_activities/data/gwa.qc.A1 --freq --out ./02_activities/data/gwa_qc_A1_freq
+printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ./02_activities/data/snplist_A1.txt
+plink2 --bfile ./02_activities/data/gwa.qc.A1 \
+      --extract ./02_activities/data/snplist_A1.txt \
+      --freq \
+      --out ./02_activities/data/gwa_qc_A1_freq
+```
+```{r,message=FALSE, warning=FALSE}
+freq <- fread("./02_activities/data/gwa_qc_A1_freq.afreq")
+head(freq)
 ```
 
+I am not fully sure about the question, but from the table we can see that the alternative allele frequencies (AFs) for the four SNPs are as follows:
+rs3813199: AF = 0.0569
+rs11804831: AF = 0.1543
+rs3128342: AF = 0.3051
+rs1861: AF = 0.0540
+If you are asking for the reference allele frequencies, you can calculate them as 1 - AF.
+
 (ii) What are the minor allele frequencies (MAFs) for these four SNPs?
 
-```         
-# Your code here...
-```
+As we can see the alternative allele frequencies (AFs) for the four SNPs are all less than 0.5, so the minor allele frequencies (MAFs) for these four SNPs are the same as their AFs:
+rs3813199: AF = 0.0569
+rs11804831: AF = 0.1543
+rs3128342: AF = 0.3051
+rs1861: AF = 0.0540
+
 
 #### Question 3: Hardy–Weinberg Equilibrium (HWE) Test
 
 (i) Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame.
 
-```         
-# Your code here...
+```{bash,results='hide',message=FALSE, warning=FALSE}
+plink2 --bfile ./02_activities/data/gwa.qc.A1 --hardy --out ./02_activities/data/gwa_qc_A1_hwe
+```
+
+```{r,message=FALSE, warning=FALSE}
+# View HWE results
+hwe <- fread("./02_activities/data/gwa_qc_A1_hwe.hardy")
+head(hwe)
 ```
 
 (ii) What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831?
 
-```         
-# Your code here...
+```{r,message=FALSE, warning=FALSE}
+# View HWE results
+hwe_subset <- hwe %>% filter(ID %in% c("rs1861", "rs3813199", "rs3128342", "rs11804831"))
+hwe_subset
 ```
 
+The HWE p-values for the four SNPs are shown in the last column of the table above.
+
 #### Question 4: Genetic Association Test
 
 (i) Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value?
 
-```         
-# Your code here...
+```{bash,results='hide',message=FALSE, warning=FALSE}
+plink2 --bfile ./02_activities/data/gwa.qc.A1 --export A --out ./02_activities/data/gwa_qc_A1_additive
+```
+
+```{r}
+geno <- fread("./02_activities/data/gwa_qc_A1_additive.raw")
+geno[1:5,1:10]
+model_additive <- lm(PHENOTYPE ~ rs1861_C, data = geno)
+summary(model_additive)
 ```
+The p-value for the association between SNP rs1861 and the phenotype is 2e-16.
 
 (ii) How would you interpret the beta coefficient from this regression?
 
-```         
-# Your answer here...
-```
+Each additional copy of the C-G allele at rs1861 is associated with an increase of 0.974 units in the outcome variable of Phenotype.
 
 (iii) Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot.
+```{r}
+geno_range <- data.frame(
+   rs1861_C = seq(0, 2, length.out = 100)
+ )
+geno_range$predicted_prob <- predict(model_additive, newdata = geno_range, type = "response")
+
+p <- ggplot(geno, aes(x = rs1861_C, y = PHENOTYPE)) +
+  geom_point(color = "blue", alpha = 0.6) +
+    geom_line(data = geno_range, aes(x = rs1861_C, y = predicted_prob), color = "red", size = 1) +
+  labs(
+    title = "Linear Regression: PHENOTYPE ~ Genotype",
+    x = "Genotype (0/1/2)",
+    y = "Phenotype"
+  ) +
+  theme_minimal()
+print(p)
 
-```         
-# Your code here...
 ```
 
 (iv) Convert the genotype coding for rs1861 to recessive coding.
 
-```         
-# Your code here...
+```{r}      
+geno_dom <- geno  # make a copy
+geno_dom[, 7:ncol(geno_dom)] <- as.data.frame(
+  lapply(geno[, 7:ncol(geno)], function(x) ifelse(x == 2, 1, 0))
+)
 ```
 
 (v) Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value?
 
-```         
-# Your code here...
+```{r}        
+model_recessive <- lm(PHENOTYPE ~ rs1861_C, data = geno_dom)
+summary(model_recessive)
 ```
+The p-value for the association between the recessive-coded rs1861 and the phenotype is 2e-16.
 
 (vi) Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot.
 
-```         
-# Your code here...
+```{r}
+geno_range <- data.frame(
+   rs1861_C = seq(0, 1, length.out = 100)
+ )
+geno_range$predicted_prob <- predict(model_recessive, newdata = geno_range, type = "response")
+
+p <- ggplot(geno_dom, aes(x = rs1861_C, y = PHENOTYPE)) +
+  geom_point(color = "blue", alpha = 0.6) +
+    geom_line(data = geno_range, aes(x = rs1861_C, y = predicted_prob), color = "red", size = 1) +
+  labs(
+    title = "Linear Regression: PHENOTYPE ~ Genotype",
+    x = "Genotype (0/1)",
+    y = "Phenotype"
+  ) +
+  theme_minimal()
+print(p)
+
 ```
 
 (vii) Which model fits better? Justify your answer.
 
-```         
-# Your answer here...
-```
-
+The additive model fits better than the recessive model. This is because the additive model has a higher adjusted R-squared value (0.08901) compared to the recessive model (0.08559
+), indicating that the additive model explains more variance in the phenotype. 
 ### Criteria
 
 | Criteria | Complete | Incomplete |

diff --git a/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-12-1.png b/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-12-1.png
diff --git a/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-15-1.png b/02_activities/assignments/assignment_1_files/figure-html/unnamed-chunk-15-1.png
diff --git a/...ents/assignment_1_files/libs/bootstrap/bootstrap-138a6193a3bd40baf1e627da441a4734.min.css b/...ents/assignment_1_files/libs/bootstrap/bootstrap-138a6193a3bd40baf1e627da441a4734.min.css