diff --git a/src/exercises/MCMC_1-intro.jl b/src/exercises/MCMC_1-intro.jl new file mode 100644 index 0000000..1633edd --- /dev/null +++ b/src/exercises/MCMC_1-intro.jl @@ -0,0 +1,484 @@ +### A Pluto.jl notebook ### +# v0.20.4 + +#> [frontmatter] +#> order = "20" +#> title = "4. MCMC intro" +#> date = "2025-03-24" +#> tags = ["exercises"] +#> description = "Introduction to the inference practicals" +#> layout = "layout.jlhtml" +#> +#> [[frontmatter.author]] +#> name = "Bram Spanoghe" + +using Markdown +using InteractiveUtils + +# ╔═╡ 06bdc430-b965-11ef-36a4-3d863afbaf6e +using Pkg; Pkg.activate("../../pluto-deployment-environment") + +# ╔═╡ ce9c8c34-3690-4241-b021-c08868157a55 +using Turing, StatsPlots + +# ╔═╡ ea5d107c-f171-401f-8972-6a928ab72907 +using PlutoUI; TableOfContents() + +# ╔═╡ eabed73e-19dc-4265-965a-cf762d630fb3 +md"# Inference notebook #1: Intro" + +# ╔═╡ daed8bc0-8a85-45ce-84ac-0d13ff1923f1 +md"## Problem" + +# ╔═╡ 6a79727c-5b5e-43cc-862f-7182c1ea878c +md""" +According to the [molecular clock hypothesis](https://en.wikipedia.org/wiki/Molecular_clock), the average amount of mutations in a gene $\bar{N}$ is proportional to how much time $t$ has passed, and identical for all species: + +$\bar{N} = \alpha \, t \, .$ + +While this is a bit of an oversimplification, the concept has become an important tool in evolutionary biology to estimate how long ago species have diverged. +""" + +# ╔═╡ 83aa4c69-c027-4ade-ac44-438d784b2b78 +md""" +Consider the below figure of a small slice of the [tree of life](https://en.wikipedia.org/wiki/Tree_of_life_(biology)). Every animal represents a (fossilized) individual living during some point in evolution. +""" + +# ╔═╡ 3ecd3ac7-2621-4ea6-8ff1-69962769d934 +md""" +![Evolution example](https://raw.githubusercontent.com/Kermit-UGent/ModSim/2a369561ce842cf079d7660a36d0d9308739dc69/examples/ProbMod/figures/treeoflife.excalidraw.svg) +""" + +# ╔═╡ 71bbd593-f2e4-40dd-b682-40e65305ebb3 +md""" +We start at time 0 with a common ancestor of fish and terrestrial animals. 30 million years (Ma) later it diverges into ray-finned fish, which will give rise to most modern fish species, and lob-finned fish, which will give rise to i.e. mammals and reptiles. + +The ray-finned fish fossil is also one of the individuals for which we have DNA for its *cytochrome C* gene. The number represents that it has 25 mutations in this gene compared to the gene's sequence from our starting organism, the ancient bony fish fossil. +""" + +# ╔═╡ 138538d4-f6b5-4b8e-af3a-273858cc463c +md""" +Taking into account all fossils, we can see that the number of mutations is roughly proportional with the time that has passed. +""" + +# ╔═╡ f49c6d3f-5870-4473-b142-799fa84dbfb7 +times = [30, 138, 375, 450] + +# ╔═╡ 4ea82c1e-490f-4d88-b60c-5c2118409408 +observed_mutations = [25, 94, 302, 335] + +# ╔═╡ 26128162-e355-4316-8d74-291fbca194a6 +scatter(times, observed_mutations, xlabel = "Time (My)", ylabel = "Number of mutations", legend = false, xlims = (0, 500), ylims = (0, 400)) + +# ╔═╡ d3fdfa6f-1ee2-48d0-bb6c-483a6fb4faf9 +md"This leads us to our first question:" + +# ╔═╡ d443d3e8-c1ee-467f-88b2-a0211d5b5cc0 +md""" +!!! question + - What are realistic values for *cytochrome C*'s mutation rate `α`? +""" + +# ╔═╡ 5d83f336-372b-4493-baf6-8efffb663ff1 +md""" +Figuring out the answer to the first question allows for some exciting follow-up questions! Consider for example that you find a new fossil of an ancient ancestor of the **seahorses**. +""" + +# ╔═╡ 6666a388-1c45-4f6f-804f-eb3a260eae98 +md"![Sharkmoment](https://raw.githubusercontent.com/Kermit-UGent/ModSim/2a369561ce842cf079d7660a36d0d9308739dc69/examples/ProbMod/figures/treeoflife2.excalidraw.svg)" + +# ╔═╡ b0dbd481-ce36-4364-b995-b4fa8f36f76d +md""" +You don't know how old the fossil is, but you do find that the fossilized DNA contains **156** mutations in the _cytochrome c_ gene. How old should it be estimated as? +""" + +# ╔═╡ 641a1a77-6fcb-4a6c-8f8f-3b171704efa8 +md"Our second question is then:" + +# ╔═╡ 442aa57f-79d8-4b22-93b2-466ac92c2b13 +md""" +!!! question + - What values are likely for the seahorse-ancestor fossil's age? +""" + +# ╔═╡ 12f017e3-b7b8-408d-a677-52dd2ea900eb +md"## Explanation" + +# ╔═╡ 7900fd13-cc69-41c5-972b-16d5b6d5452a +md"""### Making the model""" + +# ╔═╡ 430927e8-7fb8-494c-b9da-d46f000c142f +md""" +We start again by defining a Turing model. Similar to the models of previous practical, it describes the *forward process*: how do you generate your observations (the amount of mutations $N$) based on your inputs (time that has passed $t$) and parameters (the mutation rate $α$)? + +As discussed in the introduction, we assume a linear relationship between the **average** amount of mutations $\bar{N}$ and $t$: + +$\bar{N} = \alpha \, t$ +""" + +# ╔═╡ 656d3299-e5d0-4799-84ae-da32f51bd92e +md""" +Since the accumulation of mutations is a random process, we can't expect the number of mutations $N$ to be exactly the predicted average $\bar{N}$. We can however expect it to be close to the predicted average. We can express this by saying the amount of mutations follows a distribution centered around the expected average: + +$N \sim \text{Poisson}(\bar{N}) \, .$ +""" + +# ╔═╡ 6ec75944-0b62-43a3-80e9-d6cd2ebd4478 +md""" +!!! note + Can you explain why a Poisson distribution is a natural fit here? +""" + +# ╔═╡ 0b5205c2-0eb4-4734-8470-055d29954f63 +md""" +One problem: our model uses the gene's mutation rate `α`, but we don't know what it is. However, we do have some **prior** knowledge about mutation rates of essential genes (in general): they don't tend to be much larger than a few bp/My. We can encode this information by giving `α` the prior distribution `Exponential(2)`. +""" + +# ╔═╡ 2034e550-84df-42ee-9f0d-2bf417563dd4 +prior_alpha = Exponential(2) + +# ╔═╡ 3221f614-ef9a-4e09-a7da-b649fff0ab61 +plot(prior_alpha, title = "Prior belief of α", legend = false, xlabel = "α", ylabel = "Probability density") + +# ╔═╡ 7a26cfec-b362-41c4-9b2e-1d040db8a5eb +md""" +We can sample some values from this prior distribution and use them to plot the *a priori* expected trends between time $t$ and average number of mutations $\bar{N}$: +""" + +# ╔═╡ 1080294d-8da9-4326-a53b-afe158dc2ab9 +begin + scatter(times, observed_mutations, xlabel = "Time (My)", + ylabel = "Number of mutations", label = false, xlims = (0, 500), + ylims = (0, 400), title = "A priori relationship between t and mean N" + ); + for α in rand(prior_alpha, 1000) + plot!(x -> α*x, color = :purple, alpha = 0.05, label = false); + end + plot!() +end + +# ╔═╡ 624a0a67-c5c1-437c-89e1-e08e83003b41 +md""" +We can see that some of the mutation rates from our prior distribution result in a relationship that matches the data well. However, our prior belief is *much too broad*: only a few of the lines are realistic for the data. Finding which of these lines from our prior belief are realistic for the data is essentially what inference does. +""" + +# ╔═╡ 97257007-fbfa-4065-8788-869801ce3730 +md""" +!!! note + Why use `Exponential(2)` for the prior? + + Choosing a prior distribution is largely subjective and a big reason why some people are not fond of Bayesian modeling. There is no "one correct prior distribution". + + However, different choices of reasonable priors often give very similar outcomes. Try running this notebook at the end with a different prior for `α`, such as `Exponential(10)` or `Uniform(0, 100)`. When are the results significantly different? +""" + +# ╔═╡ 2a652734-bf47-405b-abb3-00e3d453bf47 +md""" +**In summary**, we now have: +- a model that predicts our output $N$ from the input $t$: +$\bar{N} = α \, t \, .$ +$N \sim \text{Poisson}(\bar{N}) \, .$ +- a prior distribution for the parameter $α$. +""" + +# ╔═╡ c52854b2-b2b6-4437-b65b-a1f912d054d5 +md"We then translate it into a Turing model:" + +# ╔═╡ 950c03a9-0056-4760-81f8-dd10d1e55cea +@model function mutations(ts) + α ~ prior_alpha # prior distribution of parameter + + N = zeros(length(ts)) # output variable: in this model we have multiple values, so we need to preallocate a vector + for i in 1:length(ts) + N_average = α * ts[i] + N[i] ~ Poisson(N_average) + end + + return N +end + +# ╔═╡ 31378eb3-51a5-4ad6-a713-7f77c7ceafcc +md""" +As you may notice, we have defined the mutation times $t$ as an input to the Turing model. This is not strictly necessary: you can also just hardcode the given values of $t$, variable `times`, in the Turing model, but this way you can easily define the model for different values of the input variables. + +You simply instantiate the model with the correct values of $t$ as follows: +""" + +# ╔═╡ 48f6b7dc-13aa-4057-8468-97db047773ba +mutation_model = mutations(times); + +# ╔═╡ d4dbc185-6087-4862-b6e5-b5e4f51c2866 +md"And can then generate random samples of the output as we are used to:" + +# ╔═╡ 8b0bf05f-92a0-4ce7-8042-08c790088688 +mutation_model() # random sample of N + +# ╔═╡ 3e4998e7-4981-4945-8bf2-ddb5afcb43b1 +chain = sample(mutation_model, Prior(), 2000); + +# ╔═╡ 3421987d-1aab-4cab-bf83-dd3653715bce +α_sp = chain[:α]; # random sample of α + +# ╔═╡ a9d54dfc-5337-412f-86b0-deeb5a0b6928 +histogram(α_sp, title = "Sample of prior of α") + +# ╔═╡ 425b4b6c-76b8-4676-8d0a-fd26711400d6 +md"### Inference" + +# ╔═╡ b8adbdd4-2642-4375-9979-0cb8f52c5bc8 +md""" +The model so far has no extra information outside of our prior knowledge. +We can change this by **conditioning** the model on observed data as follows: +""" + +# ╔═╡ 70f9e94d-a4e6-47d6-8d19-b60f7011d572 +conditioned_model = mutation_model | (N = observed_mutations,) + +# ╔═╡ 00c24cbb-88ba-49f2-9bf5-f44538fb2413 +md"Note the syntax: we tell Turing that the value of the random variable `N` defined in our Turing model should be the values given by the variable `observed_mutations`." + +# ╔═╡ 2d0c969d-03a2-4e4c-ace4-e439f81c771b +md""" +!!! danger + Note the `,` at the end of `(N = observed_mutations,)`. This is important, as without it Julia thinks you simply put parentheses around a variable assignment and you'll get an error! See the below cell for an example. +""" + +# ╔═╡ a35a43e2-e6b0-47ce-80b2-48148336274c +forgot_comma = mutation_model | (N = observed_mutations) + # errors because there is no `,` in the parentheses + +# ╔═╡ c5f0dbb3-fba1-41f2-b7d2-740012603555 +md""" +We can verify that for our conditioned model, the value of `N` has been set as constant: +""" + +# ╔═╡ d03cef36-3e82-4de4-89e7-af9f772edd8d +conditioned_model() # always returns `observed_mutations` + +# ╔═╡ 371a48d5-daea-4d0b-968b-7e3056a65494 +md""" +What we're after is our updated belief on the distribution of `α` given the observed data. We can do this by using the `sample` function on our model. We no longer use `Prior()` as second input, and instead choose one of the following sampling algorithms: +- `MH`: Metropolis-Hastings sampler +- `Gibbs`: Gibbs sampler +- `PG`: Particle Gibbs sampler +- `HMC`: Hamiltonian Monte Carlo sampler +- `NUTS`: No-U-Turn sampler + +You can find more information about them in the corresponding Julia docs (see the `🔍Live Docs` in the bottom right corner). In practice, `NUTS` is often an excellent choice if all prior distributions are continuous and `PG` with 10-20 particles is a good default choice in all other cases. (`MH` and `Gibbs` also have their uses, but usually it takes more effort to make them work well.) +""" + +# ╔═╡ 0d2c1359-434f-4f3d-8c04-c452c46d7ae8 +mutation_chain = sample(conditioned_model, NUTS(), 2000) + +# ╔═╡ 1c00437c-e2f3-44f6-b020-ca213e321239 +md"It's always a good idea to check whether your sampling process has converged. You can do this by plotting the chain. It should look like a fuzzy caterpillar." + +# ╔═╡ 4c79adff-0640-4e69-815d-ab94ebd9c937 +plot(mutation_chain) # looks appropriately fuzzy! + +# ╔═╡ f620d591-7982-4b67-9524-45cfac27436b +md""" +!!! note + For an example of a non-converged chain, try using the `MH()` sampler instead of `NUTS()`. This sampling algorithm takes a lot of fiddling with its parameters (or a larger number of samples) for it to work well. +""" + +# ╔═╡ 251a1b0e-7efc-4ce2-b0ea-48a5c13d2c63 +md""" +The chain plot also shows the resulting **posterior distribution** of `α`. It is the prior distribution updated with the information contained in the data. +""" + +# ╔═╡ 6ff29c57-aca3-4ebe-a206-e733e81bcc20 +md""" +Taking the sampled values of the mutation rate from the chain and plotting a histogram will show us the exact same distribution as above. The one in the chain plot was simply smoothed to look continuous. +""" + +# ╔═╡ b657217a-6ccc-4a41-b852-df4e39a7a10a +sp_alpha = mutation_chain[:α]; + +# ╔═╡ 9cf08616-d599-42a3-82e8-e8c98853c1d8 +histogram(sp_alpha) # note the difference with the prior distribution! + +# ╔═╡ e02c42dc-627c-4bc0-8ebb-fdb2b0f15b64 +md"Plotting some sampled mutation rates from this distribution onto our data shows that they fit well:" + +# ╔═╡ 87e70d5a-7a45-4a3e-b6c4-a894cc78621b +begin + scatter(times, observed_mutations, xlabel = "Time (My)", + ylabel = "Number of mutations", label = false, xlims = (0, 500), + ylims = (0, 400), title = "A posteriori relationship between t and mean N" + ); + plot!([x -> αᵢ*x for αᵢ in sp_alpha[1:10:end]], color = :purple, opacity = 0.1, label = false) +end + +# ╔═╡ 87059440-2919-4f6d-9d32-0df3ce75e2a2 +md"And finally we can answer our first question:" + +# ╔═╡ 644cff58-68b9-4d4b-8896-617fcacc39c5 +mean(sp_alpha) + +# ╔═╡ 5f2f5a78-ca90-4baf-8bf0-ff1fae88d785 +sqrt(var(sp_alpha)) + +# ╔═╡ bfbf811f-7b2f-473b-b132-0c7e965c8b0d +md""" + $α$ is ± normally distributed around 0.75 with a standard deviation of 0.025. +""" + +# ╔═╡ d2d83d9a-0ed6-431b-8764-397c1bb019c2 +md"### Seahorses (extra)" + +# ╔═╡ 4e9dc370-6aca-40f7-807b-e85d412ab1a0 +md""" +To answer how old the ancestral seahorse fossil is, we need to update the model a little. +So far the fossil ages were considered to be known exactly and given as input to the model `ts`. Since the seahorse fossil's age is unknown, we add a parameter for it called `fossil_age`. + +As prior knowledge we can use the fact that it must have evolved _after_ the ray-finned fish fossil (30 Ma after weird old fish), but _before_ modern seahorses (450 Ma after the bony fish fossil). +""" + +# ╔═╡ fd15afe1-72d7-4663-b2b6-afa0dd219db8 +@model function horsetations(ts) + α ~ prior_alpha # prior distribution of parameter + + N = zeros(length(ts)) # output variable: in this model we have multiple values, so we need to preallocate a vector + for i in eachindex(ts) + N_average = α * ts[i] + N[i] ~ Poisson(N_average) + end + + fossil_age ~ Uniform(30, 450) + horse_mutations ~ Poisson(α * fossil_age) + + return N +end + +# ╔═╡ 43509549-f926-478b-a4da-995de443b3a7 +md"Then we simply repeat model instantation, conditioning and sampling:" + +# ╔═╡ dbb8ad46-4ac1-443e-b39a-89ddf938ede2 +horse_model = horsetations(times) + +# ╔═╡ 8b380751-0e55-4691-bab0-247fa1b7c510 +horseditioned_model = horse_model | (N = observed_mutations, horse_mutations = 156); + +# ╔═╡ e4eda27b-cff2-4ef9-bbdc-75ce2b19b10b +horse_chain = sample(horseditioned_model, NUTS(), 2000) + +# ╔═╡ da82b71b-fdf5-4fba-8ef3-f2aac67d3494 +md"And we have our posterior distribution of `fossil_age`! It seems like the seahorse ancestor lived about 200-220 million years after the bony fish fossil, or about 240 million years ago." + +# ╔═╡ 7e5a39c7-cb66-4563-8f58-7245f88c9b85 +histogram(horse_chain[:fossil_age]) + +# ╔═╡ 1e152790-a277-4dba-8785-d6a6120cc34f +md"## The essentials" + +# ╔═╡ 0e044dc4-2fc4-4c41-9efa-e586fd138a69 +md"This section again reitarates the essential code for this practical without much explanations." + +# ╔═╡ e9d2f2ca-48ed-4cd8-bd0a-ff0ef640c8fb +# ╠═╡ disabled = true +#=╠═╡ +let + @model function mutations(ts) + α ~ Exponential(2) # prior distribution of parameter + + N = zeros(length(ts)) # output variable: in this model we have multiple values, so we need to preallocate a vector + for i in eachindex(ts) + N_average = α * ts[i] + N[i] ~ Poisson(N_average) + end + + return N + end + + mutation_model = mutations(times); # instantiate model + conditioned_model = mutation_model | (N = observed_mutations,) + # condition model + # mind the `,` after `observed_mutations`! + mutation_chain = sample(conditioned_model, NUTS(), 2000); + # run inference with an appropriate sampler + α_sp = mutation_chain[:α] # get sample of posterior distribution of α + histogram(α_sp, + title = "Posterior distribution of mutation rate α" + ) +end + ╠═╡ =# + +# ╔═╡ Cell order: +# ╟─eabed73e-19dc-4265-965a-cf762d630fb3 +# ╠═06bdc430-b965-11ef-36a4-3d863afbaf6e +# ╠═ce9c8c34-3690-4241-b021-c08868157a55 +# ╠═ea5d107c-f171-401f-8972-6a928ab72907 +# ╟─daed8bc0-8a85-45ce-84ac-0d13ff1923f1 +# ╟─6a79727c-5b5e-43cc-862f-7182c1ea878c +# ╟─83aa4c69-c027-4ade-ac44-438d784b2b78 +# ╟─3ecd3ac7-2621-4ea6-8ff1-69962769d934 +# ╟─71bbd593-f2e4-40dd-b682-40e65305ebb3 +# ╟─138538d4-f6b5-4b8e-af3a-273858cc463c +# ╠═f49c6d3f-5870-4473-b142-799fa84dbfb7 +# ╠═4ea82c1e-490f-4d88-b60c-5c2118409408 +# ╟─26128162-e355-4316-8d74-291fbca194a6 +# ╟─d3fdfa6f-1ee2-48d0-bb6c-483a6fb4faf9 +# ╟─d443d3e8-c1ee-467f-88b2-a0211d5b5cc0 +# ╟─5d83f336-372b-4493-baf6-8efffb663ff1 +# ╟─6666a388-1c45-4f6f-804f-eb3a260eae98 +# ╟─b0dbd481-ce36-4364-b995-b4fa8f36f76d +# ╟─641a1a77-6fcb-4a6c-8f8f-3b171704efa8 +# ╟─442aa57f-79d8-4b22-93b2-466ac92c2b13 +# ╟─12f017e3-b7b8-408d-a677-52dd2ea900eb +# ╟─7900fd13-cc69-41c5-972b-16d5b6d5452a +# ╟─430927e8-7fb8-494c-b9da-d46f000c142f +# ╟─656d3299-e5d0-4799-84ae-da32f51bd92e +# ╟─6ec75944-0b62-43a3-80e9-d6cd2ebd4478 +# ╟─0b5205c2-0eb4-4734-8470-055d29954f63 +# ╠═2034e550-84df-42ee-9f0d-2bf417563dd4 +# ╟─3221f614-ef9a-4e09-a7da-b649fff0ab61 +# ╟─7a26cfec-b362-41c4-9b2e-1d040db8a5eb +# ╟─1080294d-8da9-4326-a53b-afe158dc2ab9 +# ╟─624a0a67-c5c1-437c-89e1-e08e83003b41 +# ╟─97257007-fbfa-4065-8788-869801ce3730 +# ╟─2a652734-bf47-405b-abb3-00e3d453bf47 +# ╟─c52854b2-b2b6-4437-b65b-a1f912d054d5 +# ╠═950c03a9-0056-4760-81f8-dd10d1e55cea +# ╟─31378eb3-51a5-4ad6-a713-7f77c7ceafcc +# ╠═48f6b7dc-13aa-4057-8468-97db047773ba +# ╟─d4dbc185-6087-4862-b6e5-b5e4f51c2866 +# ╠═8b0bf05f-92a0-4ce7-8042-08c790088688 +# ╠═3e4998e7-4981-4945-8bf2-ddb5afcb43b1 +# ╠═3421987d-1aab-4cab-bf83-dd3653715bce +# ╟─a9d54dfc-5337-412f-86b0-deeb5a0b6928 +# ╟─425b4b6c-76b8-4676-8d0a-fd26711400d6 +# ╟─b8adbdd4-2642-4375-9979-0cb8f52c5bc8 +# ╠═70f9e94d-a4e6-47d6-8d19-b60f7011d572 +# ╟─00c24cbb-88ba-49f2-9bf5-f44538fb2413 +# ╟─2d0c969d-03a2-4e4c-ace4-e439f81c771b +# ╠═a35a43e2-e6b0-47ce-80b2-48148336274c +# ╟─c5f0dbb3-fba1-41f2-b7d2-740012603555 +# ╠═d03cef36-3e82-4de4-89e7-af9f772edd8d +# ╟─371a48d5-daea-4d0b-968b-7e3056a65494 +# ╠═0d2c1359-434f-4f3d-8c04-c452c46d7ae8 +# ╟─1c00437c-e2f3-44f6-b020-ca213e321239 +# ╠═4c79adff-0640-4e69-815d-ab94ebd9c937 +# ╟─f620d591-7982-4b67-9524-45cfac27436b +# ╟─251a1b0e-7efc-4ce2-b0ea-48a5c13d2c63 +# ╟─6ff29c57-aca3-4ebe-a206-e733e81bcc20 +# ╠═b657217a-6ccc-4a41-b852-df4e39a7a10a +# ╠═9cf08616-d599-42a3-82e8-e8c98853c1d8 +# ╟─e02c42dc-627c-4bc0-8ebb-fdb2b0f15b64 +# ╟─87e70d5a-7a45-4a3e-b6c4-a894cc78621b +# ╟─87059440-2919-4f6d-9d32-0df3ce75e2a2 +# ╠═644cff58-68b9-4d4b-8896-617fcacc39c5 +# ╠═5f2f5a78-ca90-4baf-8bf0-ff1fae88d785 +# ╟─bfbf811f-7b2f-473b-b132-0c7e965c8b0d +# ╟─d2d83d9a-0ed6-431b-8764-397c1bb019c2 +# ╟─4e9dc370-6aca-40f7-807b-e85d412ab1a0 +# ╠═fd15afe1-72d7-4663-b2b6-afa0dd219db8 +# ╟─43509549-f926-478b-a4da-995de443b3a7 +# ╠═dbb8ad46-4ac1-443e-b39a-89ddf938ede2 +# ╠═8b380751-0e55-4691-bab0-247fa1b7c510 +# ╠═e4eda27b-cff2-4ef9-bbdc-75ce2b19b10b +# ╟─da82b71b-fdf5-4fba-8ef3-f2aac67d3494 +# ╠═7e5a39c7-cb66-4563-8f58-7245f88c9b85 +# ╟─1e152790-a277-4dba-8785-d6a6120cc34f +# ╟─0e044dc4-2fc4-4c41-9efa-e586fd138a69 +# ╠═e9d2f2ca-48ed-4cd8-bd0a-ff0ef640c8fb diff --git a/src/exercises/MCMC_2-basics.jl b/src/exercises/MCMC_2-basics.jl new file mode 100644 index 0000000..5775bc7 --- /dev/null +++ b/src/exercises/MCMC_2-basics.jl @@ -0,0 +1,611 @@ +### A Pluto.jl notebook ### +# v0.20.4 + +#> [frontmatter] +#> order = "21" +#> title = "4. MCMC basics" +#> date = "2025-03-24" +#> tags = ["exercises"] +#> description = "Basic inference exercises" +#> layout = "layout.jlhtml" +#> +#> [[frontmatter.author]] +#> name = "Bram Spanoghe" + +using Markdown +using InteractiveUtils + +# ╔═╡ 94c6f31d-1a43-4221-b60c-1fa0ef8738b8 +using Pkg; Pkg.activate("../../pluto-deployment-environment") + +# ╔═╡ 45bc5b66-c81b-4afb-8a7e-51aff9609c62 +using Turing, StatsPlots + +# ╔═╡ 9daad9dc-b092-4845-a276-ed3a24249924 +using PlutoUI; TableOfContents() + +# ╔═╡ 41dc8060-cf5e-11ef-26f9-892577e77af0 +md"# Inference notebook #2: Basics" + +# ╔═╡ 299ba93b-0fc0-4bb3-9a2c-a571ce571f1b +md"## 1: Mole burrow" + +# ╔═╡ 0d068aa6-f7c9-4a75-8382-cfbc903efc5b +md""" +Consider a mole's underground tunnel network of length `X` (in m). Now and then the mole makes a new molehill somewhere randomly above its tunnel, the locations of which we denote `Y`. +""" + +# ╔═╡ 5d4e269d-bd5a-422e-9547-099e04b8eedf +md""" + Y1 Y2 Y3 + /\ /\ /\ <- molehills + ================================================================= <- ground + | | | + ------------------------------------------------ <- tunnel + 0 X +""" + +# ╔═╡ 7abeb4d3-4648-4d8b-9383-de5213c8cc41 +md""" +We're no mole experts, but for our prior information we can suppose a mole would not make a tunnel much longer than a few 100 m (probably a lot shorter). + +We can formulate this as `X ~ Exponential(100)` and `Y ~ Uniform(0, X)`. + +!!! questions + 1. Plot the prior of `X`. Is it diffuse or informative? + 1. Estimate `E[Y]`. + 1. Estimate `E[X|Y = 3]` and compare it with the prior expected value `E[X]`. + 1. Plot the histogram of `X` given `Y = 3.0`. + 1. You come back a day later and find even more molehills! You now measure the following values for `Y`: `[3.0, 1.5, 0.9, 5.7]`. How long do you estimate the tunnel given this extra information? +""" + +# ╔═╡ 47aa2304-d312-40e9-b9c6-7c79a7d64de4 +md"### 1: Prior plot" + +# ╔═╡ 78c9ce62-e375-48ac-8083-55ee085c61de +X_prior = missing + +# ╔═╡ c64b346a-059d-4bbe-b166-80116260e19b +missing # plot above + +# ╔═╡ 78eb7779-f182-4419-b5d8-79a2f5c5d6da +md"### 2: Unconditional expected value of Y" + +# ╔═╡ 3decb2ec-210a-4b2f-842d-6fd40dd3f77b +@model function mole() + X ~ X_prior + Y ~ missing + return Y +end + +# ╔═╡ 76dd814d-0d9b-4f7e-aff8-990da57d052b +molemodel = mole() + +# ╔═╡ 61b8ef63-a613-4dcb-8f7a-936e0d59b862 +spY = missing + +# ╔═╡ 4e8b9000-cb61-4f01-9ba9-17276ad0335e +E_Y = missing + +# ╔═╡ 8777b133-7d7c-4a85-b89c-2f00093e9984 +md"### 3: Conditional expected value of X" + +# ╔═╡ 9695ee7e-359b-489c-962b-1bf84b052371 +cond_mole = missing + +# ╔═╡ 014538da-b5ef-41c8-b799-2c000b4c9134 +molechain = missing + +# ╔═╡ 977d0406-f154-4b87-b9c6-cafe4809a4a6 +missing # plot molechain + +# ╔═╡ af4811ce-c6e7-458a-8f89-0562355b3eb0 +spXcondY = missing + +# ╔═╡ fd540765-95e5-4071-81f7-e689b06cad0c +E_XcondY = missing + +# ╔═╡ 5521933a-a42e-4a67-94b9-84eab52ddf07 +E_X = missing + +# ╔═╡ 9403c355-dce8-4aa9-80a3-824ea6193905 +md"### 4: Conditional distribution of X" + +# ╔═╡ bedd99ca-4285-4c3d-87e2-c22d414f8f07 +# histogram + +# ╔═╡ 21743f62-c7f5-4171-8277-0667edb71c56 +md"### 5: Conditional distribution of X (with more data)" + +# ╔═╡ b2b16c34-e12a-4bea-8098-313d01913bbf +@model function mole2() + X ~ X_prior + Ys = missing + for i in missing + Ys[i] ~ missing + end +end + +# ╔═╡ 26a245d2-c5b6-484d-bdf7-541948ff06ad +Y_obs = [3.0, 1.5, 0.9, 5.7] + +# ╔═╡ dc6aadff-7cff-4a9a-a967-32cd322b2696 +missing # (some lines of code) + +# ╔═╡ ea2235f0-0cfc-4603-b027-d0a1bc6fa2c1 +# I estimate the tunnel to be about (MISSING) long + +# ╔═╡ 951d0913-1a52-4d5b-b5bb-168487e50ab2 +md"## 2: Potatoes" + +# ╔═╡ 49035a16-c419-4531-b157-a5ab357b44fe +md""" +Consider a number of potatoes `N` each with an average weight `W`. You weigh them together on an old balance to get an estimate of their total weight `T`. + +Suppose the following priors: `N ~ Poisson(10)`, `W ~ Uniform(150, 250)` and the following relationship between expected - and actual outcome: `T ~ Normal(N*W, 50)`. + +!!! questions + 1. Plot a histogram of `N` given `T = 1200`. + 1. Estimate `P(N > 6, W > 175 | T = 1200)`. + 1. Estimate `P(N = 5 | T = 1200, W = 220)`. +""" + +# ╔═╡ 5f8322ca-0ded-4750-8ad6-e66e8280daca +md"### 1: Conditional distribution of N" + +# ╔═╡ ec479be9-0d8a-4c8d-97c9-6f64d861924c +@model function potatoes() + N ~ missing + W ~ missing + T ~ missing +end + +# ╔═╡ 438893d1-3107-493c-9299-d8813658a698 +potato_model = missing + +# ╔═╡ dd7d84b0-1f20-4478-96ba-b3f4e9df1d2c +potato_cond = missing + +# ╔═╡ accca8d7-8822-49ae-8588-7259da82e37a +potato_chain = missing + # be careful with your choice of sampling algorithm! are all priors continuous? + +# ╔═╡ 519a952b-f040-4588-8acd-1cfaab88c4da +# plot chain + +# ╔═╡ 57adc714-cdd1-46cf-a4ea-d0fb04924e1e +# histogram of N + +# ╔═╡ 0fddec58-67d9-4dec-a72d-445675fa46a6 +md"### 2: Probability" + +# ╔═╡ b77d8796-9065-4c61-be21-3dc292b94492 +p_potato1 = missing + +# ╔═╡ 5bde987a-3f5c-4ff7-96a0-9e174f73fdfb +md"### 3: Probability (with more data)" + +# ╔═╡ abffd02e-b917-4e9e-a94d-de197e1b0fc3 +potato_cond2 = missing # alternative conditioned model now also given `W = 220` + +# ╔═╡ 5b30d91a-7039-4d2a-bcf7-3a990622fd2a +pota2_chain = missing + +# ╔═╡ 27939d84-d6e1-47b2-a939-5038c6bdcb48 +# plot chain + +# ╔═╡ 744cf31f-89c0-44e8-a6fb-0f4dcea04e5d +p_potato2 = missing + +# ╔═╡ ed7c5547-bd82-4ca7-bf51-dd1c201d88af +md"## 3: Lights out" + +# ╔═╡ a3c26839-2acc-49d8-98c4-3a7142dd6512 +md""" +You use 4 of the same LED light in your room. Define: +- $μ$ the **average** lifespan of your LED lights (in khr or 1000 hours) +- $Lᵢ$ the lifespan of the `i`-th LED light. + +Assume that `μ ~ LogNormal(log(40), 0.5)`. +""" + +# ╔═╡ dac55632-6e17-479e-8090-5cf8eaa67dad +md""" +!!! questions + 1. What is `E[μ]` (given no information about `Lᵢ`)? + 1. What is a sensible distribution for `Lᵢ`? (requires no code) + 1. What is `E[μ | L = [16, 20, 23, 41]]`? + 1. 🌟🌟🌟 (EXTRA DIFFICULT BONUS QUESTION): After 30 khr, two lights have died: one at 16 khr and one at 20 khr. The two other lights are still working. What is the expected value of `μ` given this information? +""" + +# ╔═╡ 7df97000-5cf0-4a29-b3c3-f867403f4318 +md"### 1: Unconditional expected value" + +# ╔═╡ 0fe475dd-411e-474e-9524-ef9fcceed7af +lights_prior = missing + +# ╔═╡ db77b90b-33b5-4d70-8229-b4c7d74fa96c +E_mu = missing + +# ╔═╡ 308cb1f5-e26b-43b2-be4d-9eabd68b3670 +md"### 2: Choice of distribution" + +# ╔═╡ 738b4097-56e1-4921-9195-2d535c78ae73 +md""" +!!! note + See theory p. 96 for an overview of elementary distributions. +""" + +# ╔═╡ d797cb9b-430b-4d1d-8ded-21959a7cb3a9 +# A sensible distribution for the lifespan of a LED-light is (missing) + +# ╔═╡ cd37e5fc-ae9a-429d-844d-4b450b187b5e +md"### 3: Conditional expected value" + +# ╔═╡ 126d3954-c77d-4c98-abe0-fd87d14e6265 +@model function lights() + μ ~ lights_prior + lifespans = missing + for i in missing + lifespans[i] ~ missing + end +end + +# ╔═╡ 37bf6ca5-e42b-406e-94d6-bcd1e706e2bd +L_obs = [16, 20, 23, 41] + +# ╔═╡ c49955c7-7082-4261-b4db-0056b5c637f1 +E_mu_cond = missing + +# ╔═╡ 9a168680-9d96-466f-ba75-122d6a391501 +md"### 4 🌟🌟🌟: Working with censored data" + +# ╔═╡ ab6331a3-676e-45e6-b5be-e9c4154ea071 +md""" +!!! note + This question is way above the exam's difficulty level. Don't feel bad if you can't find the answer right away! +""" + +# ╔═╡ 1f35d962-a249-4be7-9a96-17eb83fca7d8 +md""" +!!! hint + You can model the number of lights that still work as a `Binomial` distribution, the success rate of which depends on `μ`. +""" + +# ╔═╡ 8fc58fa4-b005-4f32-9eae-a8143582a1ae +@model function lights_censored(time_observed) + missing +end + +# ╔═╡ 410521d8-f767-4c4e-b19b-25cf31ec0f36 +E_mu_cond🌟 = missing + +# ╔═╡ 9dc0456b-7fd2-4120-8f9e-3de1984ff516 +md"## 4: Fish" + +# ╔═╡ 225cd579-1e0d-4680-8d3f-5a737a656eb8 +md""" +There are two populations of fish living in the same pond. Let `fs1` be the fraction of fish belonging to species 1, `L1` the length of a fish of species 1 and `L2` the length of a fish of species 2. + +Assume: +- You have no prior information about `fs1` except that it logically needs to be in `[0, 1]`. +- `L1 ~ Normal(90, 15)`. +- `L2 ~ Normal(60, 10)`. +""" + +# ╔═╡ 3ffac5ca-3635-4aa1-bab7-7c28e7a801cb +md""" +!!! questions + 1. If `fs1 = 0.3`, what is the prior distribution of the lengths of **all** fish in the pond? Make a plot. + 1. Estimate `fs1` if you observe fish of the following lengths: + `[94.0, 88.7, 89.6, 69.8, 52.8, 84.0, 89.3, 66.4, 95.1, 81.6]`. + 1. 🌟(BONUS QUESTION): What is the chance fish 4 belongs to species 1? +""" + +# ╔═╡ b477b212-83a2-42f0-a616-52516e152d48 +md"### 1: Prior distribution given `fs1`" + +# ╔═╡ 23056f1e-128e-463e-a80f-56299397022e +md""" +!!! hint + The distribution of fish lengths can be modelled as a `MixtureModel`. +""" + +# ╔═╡ 3cd0c888-9f11-47f1-a293-b96aa80ea3b0 +lengthdist = missing + +# ╔═╡ e23756e0-7688-4e3b-ba8d-a827e621e862 +missing # plot + +# ╔═╡ 8e030a06-5104-4c5b-b1f2-f86464e66502 +md"### 2: Conditional expected value" + +# ╔═╡ 6ebf3a16-0e6a-491f-8280-b4327ed52cf0 +len_obs = [94.0, 88.7, 89.6, 69.8, 52.8, 84.0, 89.3, 66.4, 95.1, 81.6] + +# ╔═╡ b15de91a-fc48-4cdc-a35f-6453a9a59982 +@model function fishmixture() + fs1 ~ missing # fraction of species 1 + fishlendist = missing + + fishlens = missing + for i in missing + fishlens[i] ~ fishlendist + end +end + +# ╔═╡ 94a9e163-f7c0-4d22-8d37-5641bc49eb6b +fishmodel = missing + +# ╔═╡ 40d460d2-1f2b-4d33-ba74-9a9d1e48f9ec +fishchain = missing # don't forget to plot to check convergence + +# ╔═╡ 819ed364-4bc9-43ec-aa50-b965c8f1c826 +fs1_est = missing + +# ╔═╡ 952a941a-8703-45a3-aac1-a290a181e8c5 +md"### 3🌟: Conditional expected value (spicy)" + +# ╔═╡ 8629d049-b9fd-4e9e-9b55-401a3069e956 +@model function fishmixture🌟() + fs1 ~ missing # fraction of species 1 + + fishlens = missing + isspecies1 = missing + for i in missing + isspecies1[i] ~ missing + if isspecies1[i] == 1.0 + fishlens[i] ~ missing + else + fishlens[i] ~ missing + end + end +end + +# ╔═╡ 483cbe4d-64d3-4b16-b6fc-e97b21f174a6 +p_fish4_is_species1 = missing + +# ╔═╡ 30088664-5157-4d99-8584-7a42d0acdfb8 +md"## 5: Circleference" + +# ╔═╡ 7dd2c189-79c0-4d29-9e17-9c24a78b5791 +md""" +Given three (noisy) points $P_1=(x_1,y_1)$, $P_1=(x_2,y_2)$ and $P_3=(x_3,y_3)$, you want to infer the corresponding circle. + +You can assume that the circle center can appear anywhere in the $[-20, 20]\times [-20, 20]$ square and the radius is between 0 and 50. Points are sampled randomly on the circle and have a slight amount of Gaussian noise ($\sigma=0.25$ works well). +""" + +# ╔═╡ dcbf405c-786c-4226-b35c-dc718452bb61 +md""" +!!! questions + 1. Write a small probabilistic program that can infer the center and radius of the circle. + 1. What does the inferred circle look like if you condition on only one or two of the circle points? +""" + +# ╔═╡ 791ff4ed-c9d1-48e2-9dd0-4bf3979c6167 +x1, y1 = 18.0, 2.1 + +# ╔═╡ cb189957-f9d4-480b-a492-92cfc2a8c2aa +x2, y2 = -7.3, 8.1 + +# ╔═╡ 81eca3b3-12d9-43d7-af14-e9aeb73f2471 +x3, y3 = -13.0, -23.0 + +# ╔═╡ 28d28034-e999-4e34-b6b2-63c762094c59 +begin + + function plotcircle!(p, R, xC, yC; dθ=0.01) + θ = 0:dθ:2pi+0.1 + plot!(p, xC .+ R .* cos.(θ), yC .+ R .* sin.(θ), label="", alpha=0.5, color=:blue) + return p + end + + function plotsample(R=missing, xC=missing, yC=missing; kwargs...) + p = plot(xlab="x", ylab="y", aspect_ratio=:equal; + xlims=[-40, 40], ylims=[-40, 40], kwargs...) + + scatter!([x1], [y1], label="P1") + scatter!([x2], [y2], label="P2") + scatter!([x3], [y3], label="P3") + ismissing(R) || plotcircle!(p, R, xC, yC; dθ=0.1) + return p + end + + function plotsample!(p, R=missing, xC=missing, yC=missing) + scatter!([x1], [y1], label=false) + scatter!([x2], [y2], label=false) + scatter!([x3], [y3], label=false) + ismissing(R) || plotcircle!(p, R, xC, yC; dθ=0.1) + end + +end + +# ╔═╡ 4e96908a-4fc9-429d-bf37-7a569194a038 +scatter([x1, x2, x3], [y1, y2, y3]) + +# ╔═╡ 7eedb74d-eee1-4cf0-b2bf-5febf474edd2 +md"### 1: All points" + +# ╔═╡ 84d27c98-9513-4ae3-8101-621c083a1b01 +@model function circle(σ=0.25) + # generate a circle center + xC ~ missing + yC ~ missing + + # generate a radius + R ~ missing + + # three random points in polar coordinates + θ1 ~ missing + θ2 ~ missing + θ3 ~ missing + + # P1 + x1 ~ Normal(missing, σ) + y1 ~ Normal(missing, σ) + # P2 + x2 ~ Normal(missing, σ) + y2 ~ Normal(missing, σ) + # P3 + x3 ~ Normal(missing, σ) + y3 ~ Normal(missing, σ) +end + +# ╔═╡ 543c40d5-e8a7-492d-a0b8-e7e73e5953e2 +circlemodel = circle() | (x1=x1, y1=y1, x2=x2, y2=y2, x3=x3, y3=y3); + +# ╔═╡ 25ea29e7-b391-4bd1-bdbc-1957adb8c993 +circlechain = missing + +# ╔═╡ 0d04b0e8-3d1a-4281-b175-570148569ef2 +begin + p = plot( + xlab="x", ylab="y", aspect_ratio=:equal, xlims=[-40, 40], ylims=[-40, 40], + title="Samples of P(circle|P1,P2,P3)" + ) + for i in 1:100 + plotsample!(p, circlechain[:R][i], circlechain[:xC][i], circlechain[:yC][i]) + end + p +end + +# ╔═╡ 0a140d2a-a24b-48df-9af8-7fa5d586a26f +md"### 2: Some points" + +# ╔═╡ b7649523-fd40-4fe3-8d86-fc2cb2c8c488 +circle1 = missing # given one point + +# ╔═╡ b7ba8f91-f440-4166-af9e-11fcb5f1755f +circle2 = missing # given two points + +# ╔═╡ 3b5f3623-3d00-4ba4-9182-5bddacc56567 +chain1 = missing + +# ╔═╡ c11452bc-f404-4e51-8f68-292f5b538c88 +chain2 = missing + +# ╔═╡ c1174a5c-a6a1-46a0-96be-6997e3201dfc +begin + p1 = plot( + xlab="x", ylab="y", aspect_ratio=:equal, + xlims=[-40, 40], ylims=[-40, 40], title="Samples of P(circle|P1)" + ) + for i in 1:100 + plotsample!(p1, chain1[:R][i], chain1[:xC][i], chain1[:yC][i]) + end + p1 +end + +# ╔═╡ 793fc102-fc64-4041-a04e-e0b1b0741437 +begin + p2 = plot( + xlab="x", ylab="y", aspect_ratio=:equal, + xlims=[-40, 40], ylims=[-40, 40], title="Samples of P(circle|P1,P2)" + ) + for i in 1:100 + plotsample!(p2, chain2[:R][i], chain2[:xC][i], chain2[:yC][i]) + end + p2 +end + +# ╔═╡ Cell order: +# ╟─41dc8060-cf5e-11ef-26f9-892577e77af0 +# ╠═94c6f31d-1a43-4221-b60c-1fa0ef8738b8 +# ╠═45bc5b66-c81b-4afb-8a7e-51aff9609c62 +# ╠═9daad9dc-b092-4845-a276-ed3a24249924 +# ╟─299ba93b-0fc0-4bb3-9a2c-a571ce571f1b +# ╟─0d068aa6-f7c9-4a75-8382-cfbc903efc5b +# ╟─5d4e269d-bd5a-422e-9547-099e04b8eedf +# ╟─7abeb4d3-4648-4d8b-9383-de5213c8cc41 +# ╟─47aa2304-d312-40e9-b9c6-7c79a7d64de4 +# ╠═78c9ce62-e375-48ac-8083-55ee085c61de +# ╠═c64b346a-059d-4bbe-b166-80116260e19b +# ╟─78eb7779-f182-4419-b5d8-79a2f5c5d6da +# ╠═3decb2ec-210a-4b2f-842d-6fd40dd3f77b +# ╠═76dd814d-0d9b-4f7e-aff8-990da57d052b +# ╠═61b8ef63-a613-4dcb-8f7a-936e0d59b862 +# ╠═4e8b9000-cb61-4f01-9ba9-17276ad0335e +# ╟─8777b133-7d7c-4a85-b89c-2f00093e9984 +# ╠═9695ee7e-359b-489c-962b-1bf84b052371 +# ╠═014538da-b5ef-41c8-b799-2c000b4c9134 +# ╠═977d0406-f154-4b87-b9c6-cafe4809a4a6 +# ╠═af4811ce-c6e7-458a-8f89-0562355b3eb0 +# ╠═fd540765-95e5-4071-81f7-e689b06cad0c +# ╠═5521933a-a42e-4a67-94b9-84eab52ddf07 +# ╟─9403c355-dce8-4aa9-80a3-824ea6193905 +# ╠═bedd99ca-4285-4c3d-87e2-c22d414f8f07 +# ╟─21743f62-c7f5-4171-8277-0667edb71c56 +# ╠═b2b16c34-e12a-4bea-8098-313d01913bbf +# ╠═26a245d2-c5b6-484d-bdf7-541948ff06ad +# ╠═dc6aadff-7cff-4a9a-a967-32cd322b2696 +# ╠═ea2235f0-0cfc-4603-b027-d0a1bc6fa2c1 +# ╟─951d0913-1a52-4d5b-b5bb-168487e50ab2 +# ╟─49035a16-c419-4531-b157-a5ab357b44fe +# ╟─5f8322ca-0ded-4750-8ad6-e66e8280daca +# ╠═ec479be9-0d8a-4c8d-97c9-6f64d861924c +# ╠═438893d1-3107-493c-9299-d8813658a698 +# ╠═dd7d84b0-1f20-4478-96ba-b3f4e9df1d2c +# ╠═accca8d7-8822-49ae-8588-7259da82e37a +# ╠═519a952b-f040-4588-8acd-1cfaab88c4da +# ╠═57adc714-cdd1-46cf-a4ea-d0fb04924e1e +# ╟─0fddec58-67d9-4dec-a72d-445675fa46a6 +# ╠═b77d8796-9065-4c61-be21-3dc292b94492 +# ╟─5bde987a-3f5c-4ff7-96a0-9e174f73fdfb +# ╠═abffd02e-b917-4e9e-a94d-de197e1b0fc3 +# ╠═5b30d91a-7039-4d2a-bcf7-3a990622fd2a +# ╠═27939d84-d6e1-47b2-a939-5038c6bdcb48 +# ╠═744cf31f-89c0-44e8-a6fb-0f4dcea04e5d +# ╟─ed7c5547-bd82-4ca7-bf51-dd1c201d88af +# ╟─a3c26839-2acc-49d8-98c4-3a7142dd6512 +# ╟─dac55632-6e17-479e-8090-5cf8eaa67dad +# ╟─7df97000-5cf0-4a29-b3c3-f867403f4318 +# ╠═0fe475dd-411e-474e-9524-ef9fcceed7af +# ╠═db77b90b-33b5-4d70-8229-b4c7d74fa96c +# ╟─308cb1f5-e26b-43b2-be4d-9eabd68b3670 +# ╟─738b4097-56e1-4921-9195-2d535c78ae73 +# ╠═d797cb9b-430b-4d1d-8ded-21959a7cb3a9 +# ╟─cd37e5fc-ae9a-429d-844d-4b450b187b5e +# ╠═126d3954-c77d-4c98-abe0-fd87d14e6265 +# ╠═37bf6ca5-e42b-406e-94d6-bcd1e706e2bd +# ╠═c49955c7-7082-4261-b4db-0056b5c637f1 +# ╟─9a168680-9d96-466f-ba75-122d6a391501 +# ╟─ab6331a3-676e-45e6-b5be-e9c4154ea071 +# ╟─1f35d962-a249-4be7-9a96-17eb83fca7d8 +# ╠═8fc58fa4-b005-4f32-9eae-a8143582a1ae +# ╠═410521d8-f767-4c4e-b19b-25cf31ec0f36 +# ╟─9dc0456b-7fd2-4120-8f9e-3de1984ff516 +# ╟─225cd579-1e0d-4680-8d3f-5a737a656eb8 +# ╟─3ffac5ca-3635-4aa1-bab7-7c28e7a801cb +# ╟─b477b212-83a2-42f0-a616-52516e152d48 +# ╟─23056f1e-128e-463e-a80f-56299397022e +# ╠═3cd0c888-9f11-47f1-a293-b96aa80ea3b0 +# ╠═e23756e0-7688-4e3b-ba8d-a827e621e862 +# ╟─8e030a06-5104-4c5b-b1f2-f86464e66502 +# ╠═6ebf3a16-0e6a-491f-8280-b4327ed52cf0 +# ╠═b15de91a-fc48-4cdc-a35f-6453a9a59982 +# ╠═94a9e163-f7c0-4d22-8d37-5641bc49eb6b +# ╠═40d460d2-1f2b-4d33-ba74-9a9d1e48f9ec +# ╠═819ed364-4bc9-43ec-aa50-b965c8f1c826 +# ╟─952a941a-8703-45a3-aac1-a290a181e8c5 +# ╠═8629d049-b9fd-4e9e-9b55-401a3069e956 +# ╠═483cbe4d-64d3-4b16-b6fc-e97b21f174a6 +# ╟─30088664-5157-4d99-8584-7a42d0acdfb8 +# ╟─7dd2c189-79c0-4d29-9e17-9c24a78b5791 +# ╟─dcbf405c-786c-4226-b35c-dc718452bb61 +# ╟─28d28034-e999-4e34-b6b2-63c762094c59 +# ╠═791ff4ed-c9d1-48e2-9dd0-4bf3979c6167 +# ╠═cb189957-f9d4-480b-a492-92cfc2a8c2aa +# ╠═81eca3b3-12d9-43d7-af14-e9aeb73f2471 +# ╠═4e96908a-4fc9-429d-bf37-7a569194a038 +# ╟─7eedb74d-eee1-4cf0-b2bf-5febf474edd2 +# ╠═84d27c98-9513-4ae3-8101-621c083a1b01 +# ╠═543c40d5-e8a7-492d-a0b8-e7e73e5953e2 +# ╠═25ea29e7-b391-4bd1-bdbc-1957adb8c993 +# ╠═0d04b0e8-3d1a-4281-b175-570148569ef2 +# ╟─0a140d2a-a24b-48df-9af8-7fa5d586a26f +# ╠═b7649523-fd40-4fe3-8d86-fc2cb2c8c488 +# ╠═b7ba8f91-f440-4166-af9e-11fcb5f1755f +# ╠═3b5f3623-3d00-4ba4-9182-5bddacc56567 +# ╠═c11452bc-f404-4e51-8f68-292f5b538c88 +# ╠═c1174a5c-a6a1-46a0-96be-6997e3201dfc +# ╠═793fc102-fc64-4041-a04e-e0b1b0741437 diff --git a/src/exercises/MCMC_3-advanced.jl b/src/exercises/MCMC_3-advanced.jl new file mode 100644 index 0000000..58b7f3e --- /dev/null +++ b/src/exercises/MCMC_3-advanced.jl @@ -0,0 +1,231 @@ +### A Pluto.jl notebook ### +# v0.20.4 + +#> [frontmatter] +#> order = "22" +#> title = "4. MCMC advanced" +#> date = "2025-03-24" +#> tags = ["exercises"] +#> description = "Advanced inference exercises" +#> layout = "layout.jlhtml" +#> +#> [[frontmatter.author]] +#> name = "Bram Spanoghe" + +using Markdown +using InteractiveUtils + +# This Pluto notebook uses @bind for interactivity. When running this notebook outside of Pluto, the following 'mock version' of @bind gives bound variables a default value (instead of an error). +macro bind(def, element) + #! format: off + quote + local iv = try Base.loaded_modules[Base.PkgId(Base.UUID("6e696c72-6542-2067-7265-42206c756150"), "AbstractPlutoDingetjes")].Bonds.initial_value catch; b -> missing; end + local el = $(esc(element)) + global $(esc(def)) = Core.applicable(Base.get, el) ? Base.get(el) : iv(el) + el + end + #! format: on +end + +# ╔═╡ 75581580-2fb2-4112-b397-2b775eb64630 +using Pkg; Pkg.activate("../../pluto-deployment-environment") + +# ╔═╡ e07a1ae5-43b7-4c12-831d-43e1738eeac0 +using Turing, StatsPlots + +# ╔═╡ d12cdd62-f90a-4ba8-8610-5e86e922e881 +using PlutoUI + +# ╔═╡ f84d9259-69c0-4165-8bc0-d924fef18182 +md"# Inference notebook #3: Advanced" + +# ╔═╡ aa2b6263-7d13-4683-bc92-25663ed02604 +md"## GPS" + +# ╔═╡ f3510387-1ab6-4aa2-bed9-9c8297a8b3c5 +md""" +GPS systems need to decide what road a car is following based on noisy positional data. We consider here a simplified example. + +At some known timepoints `ts`, we get noisy observations on the car's vertical position `ys_obs` (imagine it as a lattitude of sorts). There are two parallel roads (lines) the car can actually be on, which both have a constant vertical position. If the car is on road 1, then `y = 0`. If it is on road 2, then `y = 1`. The problem is visualized below. +""" + +# ╔═╡ 90f185f3-91d4-4ee7-8de3-b76251c0c169 +ts = 1:10; + +# ╔═╡ 599ae818-fb28-4803-b720-9fcf2bb162b7 +ys_obs = [0.6, 0.0, 0.8, -0.7, -0.5, 0.2, 1.0, 1.2, 1.8, 1.1]; + +# ╔═╡ b2da9dc3-006c-49d0-81a7-6375a2dc6872 +begin + p_cardata = scatter(ts, ys_obs, label = "Observed car positions", + xlabel = "Time", ylabel = "Vertical position" + ) + hline!([0.0], color = :orange, label = "Road 1", linewidth = 2) + hline!([1.0], color = :blue, label = "Road 2", linewidth = 2) +end + +# ╔═╡ 46d3214c-e724-4b98-bc1f-5b9913d2b14a +md""" +At some point `t_switch` ∈ [0, 10], the car switches from lane 1 to lane 2. We can describe the model as follows: +- If `t <= t_switch`, then `y ~ Normal(0.0, σ)`, +- If `t > t_switch`, then `y ~ Normal(1.0, σ)`, +with `σ` a noise parameter, of which you only know that it's probably small. +""" + +# ╔═╡ 0bdac14c-db71-47b5-95d5-d83fb1e68cab +md""" +Below is a plot showing the car's trajectory for some value of `t_switch`. You can adjust the slider to change this guess value. +""" + +# ╔═╡ 9e3db8b6-5466-48b6-9ca9-0d24515b9de3 +@bind switchtime Slider(0:0.1:10, default = 5.0, show_value = true) + +# ╔═╡ cc0d6b32-6f12-46b8-915f-8a471317c35e +begin + plot!( + deepcopy(p_cardata), [0.0, switchtime, switchtime, 10.0], + [0.0, 0.0, 1.0, 1.0], label = "Car trajectory", color = :black, + linewidth = 2, xticks = ([0, 10, switchtime], ["0", "10", "t_switch"]) + ) +end + +# ╔═╡ 96fc1412-83fc-4f67-8995-065b163d739c +md""" +!!! question + Infer the posterior probability of `t_switch` given the data. +""" + +# ╔═╡ 70fd793a-3ce3-446f-adc1-65ce0a68e48a +@model function cars(ts) + t_switch ~ missing + σ ~ missing + + for pointidx in 1:length(ts) + missing + end +end + +# ╔═╡ fb202b86-c058-4f03-9062-ab282c71d5c4 +missing # histogram of `t_switch` + +# ╔═╡ 34bf815a-3bd3-49b5-b06d-d4297ca213a8 +md"## Petridish peril (inference edition)" + +# ╔═╡ 78571fb5-c44f-4e7f-afde-4436db6c945b +md""" +We continue with the "petridish peril" question from the previous practical. + +You've made a model to predict bacterial population levels at certain timepoints based on your knowledge of how the species in question grows. You'd now like to update the model with information about the specific strain you're using, so you inoculate a petri dish and count the number of bacteria after a short incubation period. + +Incorporate the following information into the model to make it more accurate: +- The population level after 5 hours of incubating was 21000. +- You expect the number of bacteria you count to be Poisson distributed around the actual number. +""" + +# ╔═╡ 2a84472c-cb6f-4607-9b98-c88cc2744e3d +md""" +!!! questions + 1. Now taking into account the measurement, what are the chances of your petridish being in a splittable state after 8 hours? + 1. Visualise the updated growth curves. + 1. 🌟(BONUS): The prior for P0 being discrete doesn't allow for the use of a continuous sampler. Change the prior with a sufficiently similar continuous one to fix this. How does this affect the results? +""" + +# ╔═╡ 34120240-6fac-42b8-b5c8-bc3271927248 +md""" +!!! tip + Just like in the previous version of the question, `return`ing the estimated logistic function can be useful. +""" + +# ╔═╡ 4b94e5a6-1ed6-4095-ab18-06b76d4fec99 +logistic(t, P0, r, K) = K / (1 + (K - P0)/P0 * exp(-r*t)) + +# ╔═╡ 234b2c87-fe91-4be4-bf0c-a6f20dcc38fe +md"### 1" + +# ╔═╡ f080f708-a457-40a3-936c-b82d5159975d +dropletdist = MixtureModel([Poisson(10), Poisson(30)], [0.75, 0.25]); + +# ╔═╡ 7b4aef69-10ec-4935-b7fd-4c1d49aa9b3d +@model function petrigrowth() + P0 ~ dropletdist + r ~ LogNormal(0.0, 0.3) + K ~ Normal(1e5, 1e4) + + logfun = t -> logistic(t, P0, r, K) + Pt = missing + P_obs ~ missing + + return logfun +end + +# ╔═╡ 3e98c640-4bb0-4b5d-bae0-769133a599a7 +prob_splittable = missing + +# ╔═╡ 0ea95b67-d4da-4c5a-ad2e-05024ad074a3 +md"### 2" + +# ╔═╡ 47b15322-0b8c-48f1-b123-271ebae92655 +missing # plot + +# ╔═╡ 9ab88be4-4cf8-4747-ac36-3f1b82899be0 +md"### 3🌟" + +# ╔═╡ 113d8311-7bdc-461c-b077-920e23b33d39 +dropletdist🌟 = missing + +# ╔═╡ 4e730df9-f619-464a-b8a3-57448132404b +begin + plot(dropletdist, label = ["Original prior" ""], color = :orange) + plot!(dropletdist🌟, label = ["Continuous alternative" ""], color = :blue) + # With mixture models it takes some fiddling to make the labels look nice - don't worry about this, it's not important for the course +end + +# ╔═╡ d9b50958-20ae-4085-80d6-19420c7d89df +@model function petrigrowth🌟() + P0 ~ dropletdist🌟 + r ~ LogNormal(0.0, 0.3) + K ~ Normal(1e5, 1e4) + + logfun = t -> logistic(t, P0, r, K) + Pt = missing + P_obs ~ missing + + return logfun +end + +# ╔═╡ 0704ed4d-5b3e-401d-a496-98782cb20b09 +prob_splittable🌟 = missing + +# ╔═╡ Cell order: +# ╟─f84d9259-69c0-4165-8bc0-d924fef18182 +# ╠═75581580-2fb2-4112-b397-2b775eb64630 +# ╠═e07a1ae5-43b7-4c12-831d-43e1738eeac0 +# ╠═d12cdd62-f90a-4ba8-8610-5e86e922e881 +# ╟─aa2b6263-7d13-4683-bc92-25663ed02604 +# ╟─f3510387-1ab6-4aa2-bed9-9c8297a8b3c5 +# ╠═90f185f3-91d4-4ee7-8de3-b76251c0c169 +# ╠═599ae818-fb28-4803-b720-9fcf2bb162b7 +# ╟─b2da9dc3-006c-49d0-81a7-6375a2dc6872 +# ╟─46d3214c-e724-4b98-bc1f-5b9913d2b14a +# ╟─0bdac14c-db71-47b5-95d5-d83fb1e68cab +# ╟─9e3db8b6-5466-48b6-9ca9-0d24515b9de3 +# ╟─cc0d6b32-6f12-46b8-915f-8a471317c35e +# ╟─96fc1412-83fc-4f67-8995-065b163d739c +# ╠═70fd793a-3ce3-446f-adc1-65ce0a68e48a +# ╠═fb202b86-c058-4f03-9062-ab282c71d5c4 +# ╟─34bf815a-3bd3-49b5-b06d-d4297ca213a8 +# ╟─78571fb5-c44f-4e7f-afde-4436db6c945b +# ╟─2a84472c-cb6f-4607-9b98-c88cc2744e3d +# ╟─34120240-6fac-42b8-b5c8-bc3271927248 +# ╠═4b94e5a6-1ed6-4095-ab18-06b76d4fec99 +# ╟─234b2c87-fe91-4be4-bf0c-a6f20dcc38fe +# ╠═f080f708-a457-40a3-936c-b82d5159975d +# ╠═7b4aef69-10ec-4935-b7fd-4c1d49aa9b3d +# ╠═3e98c640-4bb0-4b5d-bae0-769133a599a7 +# ╟─0ea95b67-d4da-4c5a-ad2e-05024ad074a3 +# ╠═47b15322-0b8c-48f1-b123-271ebae92655 +# ╟─9ab88be4-4cf8-4747-ac36-3f1b82899be0 +# ╠═113d8311-7bdc-461c-b077-920e23b33d39 +# ╠═4e730df9-f619-464a-b8a3-57448132404b +# ╠═d9b50958-20ae-4085-80d6-19420c7d89df +# ╠═0704ed4d-5b3e-401d-a496-98782cb20b09 diff --git a/src/exercises/MCMC_4-review.jl b/src/exercises/MCMC_4-review.jl new file mode 100644 index 0000000..bd8bcf3 --- /dev/null +++ b/src/exercises/MCMC_4-review.jl @@ -0,0 +1,102 @@ +### A Pluto.jl notebook ### +# v0.20.4 + +#> [frontmatter] +#> order = "23" +#> title = "4. MCMC review" +#> date = "2025-03-24" +#> tags = ["exercises"] +#> description = "Review inference exercise" +#> layout = "layout.jlhtml" +#> +#> [[frontmatter.author]] +#> name = "Bram Spanoghe" + +using Markdown +using InteractiveUtils + +# ╔═╡ ef127ffc-1e2e-4c30-945b-d6cded4d6515 +using Pkg; Pkg.activate("../../pluto-deployment-environment") + +# ╔═╡ deb237f9-f0ff-4dfc-8627-43053ccdeb20 +using Turing, StatsPlots + +# ╔═╡ 2d709e7b-3350-47ec-a3ed-8aa47ad5a8c2 +function generate_data(n_wasps = 10; minbound = 0, maxbound = 1000) + + x_n, y_n = rand(DiscreteUniform(minbound, maxbound), 2) + xs, ys = [rand(DiscreteUniform(minbound, maxbound), n_wasps) for _ in 1:2] + v_wasps = rand(Uniform(5, 10), n_wasps) + ts = [2*sqrt((x-x_n)^2 + (y-y_n)^2)/v_wasp for (x, y, v_wasp) in zip(xs, ys, v_wasps)] + + return xs, ys, ts, [x_n, y_n] +end; + +# ╔═╡ af96af94-d969-4e9a-93f4-e205f8b7f576 +md"# Review exercise: Hornet nests" + +# ╔═╡ 07f05baa-1336-4e1a-9cbc-5f4506e5b34a +md""" +In recent years, the Asian giant hornet (_Vespa mandarinia_) has become an invasive species in a number of countries, including Belgium. Since they become aggressive when people get close to their nests, the nests often need to be removed when they appear in residential areas. Finding the nests, however, can be a difficult task: the hornets can go hunting over a kilometer from their nest. +""" + +# ╔═╡ 90aee617-84d5-4915-afb4-e7d422cfb4b7 +md""" +One method for finding the nest is to set up a feeder station, mark any hornets gathering food, and record how long it takes for them to fly back to their nest with it and return for more. Making an estimate of their flight speed, the return time can be used to infer the distance of that location to the nest. Repeated measurements in other locations gives enough information for a triangulation of sorts. +""" + +# ╔═╡ 082a4cc0-b151-4f8b-87da-278484218e6e +md""" +![The Asian giant hornet](https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Vespa_mandarinia_japonica1.jpg/1280px-Vespa_mandarinia_japonica1.jpg) +*The Asian giant hornet (credit: Picture by KENPEI on Wikipedia)* +""" + +# ╔═╡ bfb49d79-772d-4066-bd96-14143a1b5eeb +md""" +Consider below the coordinates of feeder stations with the return times of the hornets marked there. +""" + +# ╔═╡ c1850e64-9e2c-46fb-b7cd-22e8af81d3aa +xs, ys, ts, true_location = generate_data(); + +# ╔═╡ 28cb3363-6856-4164-b60e-36ec2e88ed56 +scatter( + xs, ys, label = "wasp locations", marker_z = ts, + title = "Locations of wasps colored by return time", + xlims = (0, 1000), ylims = (0, 1000) +) + +# ╔═╡ 484b56ec-57b2-496d-a5cf-1b1c0da97c58 +md""" +!!! question + Where is the hornet nest located? You may assume the nest is somewhere within the plot's boundaries. +""" + +# ╔═╡ 66862953-aeb2-4310-90bc-673260f0ecc0 +x_nest_sp = missing # vector with possible values of the nest's x-coordinate + +# ╔═╡ f522cbcb-112b-4d56-b860-5aee44b1aa2f +y_nest_sp = missing # vector with possible values of the nest's y-coordinate + +# ╔═╡ 511f27c3-7cdb-426f-b794-85897ff44135 +begin + scatter(x_nest_sp, y_nest_sp, opacity = 0.1, color = :blue, label = "Estimated nest locations", xlims = (0, 1000), ylims = (0, 1000), markershape = :square); + scatter!(xs, ys, color = :orange, label = "wasp locations", marker_z = ts) + scatter!(true_location[1:1], true_location[2:2], color = RGB(0, 1, 0), label = "True nest location", markershape = :square) +end + +# ╔═╡ Cell order: +# ╠═ef127ffc-1e2e-4c30-945b-d6cded4d6515 +# ╠═deb237f9-f0ff-4dfc-8627-43053ccdeb20 +# ╟─2d709e7b-3350-47ec-a3ed-8aa47ad5a8c2 +# ╟─af96af94-d969-4e9a-93f4-e205f8b7f576 +# ╟─07f05baa-1336-4e1a-9cbc-5f4506e5b34a +# ╟─90aee617-84d5-4915-afb4-e7d422cfb4b7 +# ╟─082a4cc0-b151-4f8b-87da-278484218e6e +# ╟─bfb49d79-772d-4066-bd96-14143a1b5eeb +# ╠═c1850e64-9e2c-46fb-b7cd-22e8af81d3aa +# ╟─28cb3363-6856-4164-b60e-36ec2e88ed56 +# ╟─484b56ec-57b2-496d-a5cf-1b1c0da97c58 +# ╠═66862953-aeb2-4310-90bc-673260f0ecc0 +# ╠═f522cbcb-112b-4d56-b860-5aee44b1aa2f +# ╠═511f27c3-7cdb-426f-b794-85897ff44135