Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
892 changes: 892 additions & 0 deletions 02_activities/assignments/assignment_1.html

Large diffs are not rendered by default.

149 changes: 117 additions & 32 deletions 02_activities/assignments/assignment_1.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,96 +17,181 @@ You will need to install PLINK and run the analyses. Please follow the OS-specif

Before fitting any models, it is essential to understand the data. Use R or bash code to answer the following questions about the `gwa.qc.A1.fam`, `gwa.qc.A1.bim`, and `gwa.qc.A1.bed` files, available at the following Google Drive link: <https://drive.google.com/drive/folders/11meVqGCY5yAyI1fh-fAlMEXQt0VmRGuz?usp=drive_link>. Please download all three files from this link and place them in `02_activities/data/`.

```{r setup, include=FALSE}
# Set up a consistent path for all chunks of code
knitr::opts_knit$set(root.dir = normalizePath("../../"))
```
```{r, message=FALSE, warning=FALSE, results='hide'}
library(data.table)
library(ggplot2)
library(seqminer)
library(HardyWeinberg)
library(dplyr)
```
(i) Read the .fam file. How many samples does the dataset contain?

```
# Your answer here...
```{bash}
wc -l ./02_activities/data/gwa.qc.A1.fam
```

We have 4000 samples in the dataset.


(ii) What is the 'variable type' of the response variable (i.e.Continuous or binary)?

```
# Your answer here...
```{bash}
head ./02_activities/data/gwa.qc.A1.fam
```

The variable type of the response variable is continuous, as indicated by the values in the sixth column of the .fam file, which represent the phenotype values for each sample.

(iii) Read the .bim file. How many SNPs does the dataset contain?

```
# Your answer here...
```{bash}
wc -l ./02_activities/data/gwa.qc.A1.bim
```

We have 101083 SNPs in the dataset.

#### Question 2: Allele Frequency Estimation

(i) Load the genotype matrix for SNPs rs1861, rs3813199, rs3128342, and rs11804831 using additive coding. What are the allele frequencies (AFs) for these four SNPs?

```
# Your code here...
```{bash,results='hide',message=FALSE, warning=FALSE}
#plink2 --bfile ./02_activities/data/gwa.qc.A1 --freq --out ./02_activities/data/gwa_qc_A1_freq
printf "%s\n" rs1861 rs3813199 rs3128342 rs11804831 > ./02_activities/data/snplist_A1.txt
plink2 --bfile ./02_activities/data/gwa.qc.A1 \
--extract ./02_activities/data/snplist_A1.txt \
--freq \
--out ./02_activities/data/gwa_qc_A1_freq
```
```{r,message=FALSE, warning=FALSE}
freq <- fread("./02_activities/data/gwa_qc_A1_freq.afreq")
head(freq)
```

I am not fully sure about the question, but from the table we can see that the alternative allele frequencies (AFs) for the four SNPs are as follows:
rs3813199: AF = 0.0569
rs11804831: AF = 0.1543
rs3128342: AF = 0.3051
rs1861: AF = 0.0540
If you are asking for the reference allele frequencies, you can calculate them as 1 - AF.

(ii) What are the minor allele frequencies (MAFs) for these four SNPs?

```
# Your code here...
```
As we can see the alternative allele frequencies (AFs) for the four SNPs are all less than 0.5, so the minor allele frequencies (MAFs) for these four SNPs are the same as their AFs:
rs3813199: AF = 0.0569
rs11804831: AF = 0.1543
rs3128342: AF = 0.3051
rs1861: AF = 0.0540


#### Question 3: Hardy–Weinberg Equilibrium (HWE) Test

(i) Conduct the Hardy–Weinberg Equilibrium (HWE) test for all SNPs in the .bim file. Then, load the file containing the HWE p-value results and display the first few rows of the resulting data frame.

```
# Your code here...
```{bash,results='hide',message=FALSE, warning=FALSE}
plink2 --bfile ./02_activities/data/gwa.qc.A1 --hardy --out ./02_activities/data/gwa_qc_A1_hwe
```

```{r,message=FALSE, warning=FALSE}
# View HWE results
hwe <- fread("./02_activities/data/gwa_qc_A1_hwe.hardy")
head(hwe)
```

(ii) What are the HWE p-values for SNPs rs1861, rs3813199, rs3128342, and rs11804831?

```
# Your code here...
```{r,message=FALSE, warning=FALSE}
# View HWE results
hwe_subset <- hwe %>% filter(ID %in% c("rs1861", "rs3813199", "rs3128342", "rs11804831"))
hwe_subset
```

The HWE p-values for the four SNPs are shown in the last column of the table above.

#### Question 4: Genetic Association Test

(i) Conduct a linear regression to test the association between SNP rs1861 and the phenotype. What is the p-value?

```
# Your code here...
```{bash,results='hide',message=FALSE, warning=FALSE}
plink2 --bfile ./02_activities/data/gwa.qc.A1 --export A --out ./02_activities/data/gwa_qc_A1_additive
```

```{r}
geno <- fread("./02_activities/data/gwa_qc_A1_additive.raw")
geno[1:5,1:10]
model_additive <- lm(PHENOTYPE ~ rs1861_C, data = geno)
summary(model_additive)
```
The p-value for the association between SNP rs1861 and the phenotype is 2e-16.

(ii) How would you interpret the beta coefficient from this regression?

```
# Your answer here...
```
Each additional copy of the C-G allele at rs1861 is associated with an increase of 0.974 units in the outcome variable of Phenotype.

(iii) Plot the scatterplot of phenotype versus the genotype of SNP rs1861. Add the regression line to the plot.
```{r}
geno_range <- data.frame(
rs1861_C = seq(0, 2, length.out = 100)
)
geno_range$predicted_prob <- predict(model_additive, newdata = geno_range, type = "response")

p <- ggplot(geno, aes(x = rs1861_C, y = PHENOTYPE)) +
geom_point(color = "blue", alpha = 0.6) +
geom_line(data = geno_range, aes(x = rs1861_C, y = predicted_prob), color = "red", size = 1) +
labs(
title = "Linear Regression: PHENOTYPE ~ Genotype",
x = "Genotype (0/1/2)",
y = "Phenotype"
) +
theme_minimal()
print(p)

```
# Your code here...
```

(iv) Convert the genotype coding for rs1861 to recessive coding.

```
# Your code here...
```{r}
geno_dom <- geno # make a copy
geno_dom[, 7:ncol(geno_dom)] <- as.data.frame(
lapply(geno[, 7:ncol(geno)], function(x) ifelse(x == 2, 1, 0))
)
```

(v) Conduct a linear regression to test the association between the recessive-coded rs1861 and the phenotype. What is the p-value?

```
# Your code here...
```{r}
model_recessive <- lm(PHENOTYPE ~ rs1861_C, data = geno_dom)
summary(model_recessive)
```
The p-value for the association between the recessive-coded rs1861 and the phenotype is 2e-16.

(vi) Plot the scatterplot of phenotype versus the recessive-coded genotype of rs1861. Add the regression line to the plot.

```
# Your code here...
```{r}
geno_range <- data.frame(
rs1861_C = seq(0, 1, length.out = 100)
)
geno_range$predicted_prob <- predict(model_recessive, newdata = geno_range, type = "response")

p <- ggplot(geno_dom, aes(x = rs1861_C, y = PHENOTYPE)) +
geom_point(color = "blue", alpha = 0.6) +
geom_line(data = geno_range, aes(x = rs1861_C, y = predicted_prob), color = "red", size = 1) +
labs(
title = "Linear Regression: PHENOTYPE ~ Genotype",
x = "Genotype (0/1)",
y = "Phenotype"
) +
theme_minimal()
print(p)

```

(vii) Which model fits better? Justify your answer.

```
# Your answer here...
```

The additive model fits better than the recessive model. This is because the additive model has a higher adjusted R-squared value (0.08901) compared to the recessive model (0.08559
), indicating that the additive model explains more variance in the phenotype.
### Criteria

| Criteria | Complete | Incomplete |
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading