Background yoglen was developed using R version 4.5.1. yoglen contains scripts for systems analysis of menopause from large scale studies where menopause timing is unknown. yoglen uses latent variable modelling to infer the time to menopause changes of lab tests and outcomes. it includes two parallel analyses using two related but unique models and two different datasets (NHANES and Clalit).
For additional information see the full paper (below).
Please cite as:
Pridham*, Hayut*, et al. Dynamics of menopause from deconvolution of millions of lab tests. arXiv [q-bio.TO] (2025) doi:10.48550/arXiv.2511.05906.
Contains:
-essential analysis steps (i.e. yoglen model fitting)
-plot scripts for each figure
-aggregated data needed for each figure and summary stats
-simulated NHANES data
-unedited analysis scripts for transparency (transparency folder)
Does not include individualized data:
-NHANES data (download here: https://wwwn.cdc.gov/nchs/nhanes/)
-Clalit electronic medical records
- Rstudio is needed to run .rmd files (https://posit.co/downloads/)
- R packages (dependencies.R): c("survival","mgcv","survPen","dplyr, ggplot2","cowplot","scico","gridExtra","ggrepel")
- Clone repository
- Install dependencies (dependencies.R)
- Install and open Rstudio
- Set outputDir to your desired directory (top of each .Rmd script)
- Ready to go!
git clone git@github.com:AlonLabWIS/yoglen.git
cd yoglen
R dependencies.RRun plot.Rmd
-This will use existing aggregated data from the original analysis
-This will default to using simulated data, unless the user provides real data.
The general steps are:
- Estimate the final menstrual period (FMP) distribution and impute a distribution of values for each person (age_of_menopause_imputation.Rmd),
- Model selection (model_selection.Rmd),
- Plot results (plot.Rmd).
-yoglen_simulation_study.Rmd will use the fitted models to simulate realistic data.
-yoglen_validation_step_function.Rmd simulates a simple step function to show that the approaches are insensitive to shifting and scaling of the FMP distribution.
The yoglen algorithm uses expectation maximization to maximize the expected log-likelihood using a discrete grid for the latent variable. The latent variable is the unknown age of menopause. Regression models as functions of time to menopause can be optionally supplied assuming conditional independence (i.e. ignoring direct correlations between variables but allowing them via time to menopause).
(Please forgive the use of
The full
where
In the expectation maximization approach we iteratively estimate the expected latent variable distribution,
this is exactly the weighted log-likelihood for regression with weights
- Estimate the starting distribution,
$\pi_{ij}$ , for each individual based on the menopause vs age curve. - While expected log-likelihood
$E[l|a]$ is decreasing:
2a. Estimate a weighted regression model using the vectorized$\pi_{ij}$ i.e.$f(t_{ij})$ with weight$\pi_{ij}$
2b. Update$\pi_{ij}=p(t_i|a_i,y_i)$ .
For the linear model we solve the likelihood exactly. For generalized linear models we used a spline model with default settings and weighted with the (
There are two methods used in the original paper:
Method 1. (used on NHANES)
We have observed
Method 2. (used on Clalit)
Piecewise linear model for