MESA is a Python package for sample-level multimodal cfDNA biomarker modeling. It provides a scikit-learn-style API for preprocessing, feature selection, optional redundancy pruning, modality-specific model fitting, stacked multimodal prediction, and cross-validation.
The package supports both classification and regression.
pip install mesa-cfdnaFor local development:
pip install -e .
pytest -q tests
python scripts/run_smoke_checks.py- Handles missing-value filtering and imputation
- Applies variance filtering and univariate feature selection
- Optionally prunes redundant correlated features after the first selector
- Uses Boruta for secondary feature selection
- Trains single-modality predictors and stacked multimodal models
- Evaluates models with built-in cross-validation helpers
MESA_modality: single-modality pipelineMESA: multimodal stacking ensembleMESA_CV: cross-validation wrapper
Default task-aware estimators:
- classification:
RandomForestClassifier - regression:
RandomForestRegressor
predict_proba() and transform_predict_proba() are available only in classification mode.
- Overview figure: compact pipeline summary for README or slides
- Detailed method figure: expanded schematic with task-aware branches
Regenerate both figures with:
source /data/homezvol0/chaoronc/miniconda3/etc/profile.d/conda.sh
conda activate py313
python scripts/generate_pipeline_figures.pyfrom mesa import MESA_modality, MESA, MESA_CV
modality_1 = MESA_modality(
top_n=50,
missing=0.2,
normalization=True,
redundancy_pruning="score",
redundancy_threshold=0.95,
random_state=42,
)
modality_2 = MESA_modality(
top_n=80,
missing=0.1,
redundancy_pruning="model",
redundancy_threshold=0.95,
random_state=42,
)
modality_1.fit(X1_train, y_train)
proba_1 = modality_1.transform_predict_proba(X1_test)
mesa = MESA([modality_1, modality_2], random_state=42)
mesa.fit([X1_train, X2_train], y_train)
ensemble_proba = mesa.predict_proba([X1_test, X2_test])
cv_eval = MESA_CV(MESA_modality(top_n=50, random_state=42))
cv_eval.fit(X1_train, y_train)
auc = cv_eval.get_performance()from mesa import MESA_modality, MESA, MESA_CV
reg_modality_1 = MESA_modality(
task="regression",
top_n=50,
redundancy_pruning="score",
redundancy_threshold=0.95,
random_state=42,
)
reg_modality_2 = MESA_modality(
task="regression",
top_n=80,
random_state=42,
)
reg_modality_1.fit(X1_train, y_train_continuous)
pred_1 = reg_modality_1.transform_predict(X1_test)
reg_mesa = MESA(
[reg_modality_1, reg_modality_2],
task="regression",
random_state=42,
)
reg_mesa.fit([X1_train, X2_train], y_train_continuous)
ensemble_pred = reg_mesa.predict([X1_test, X2_test])
cv_eval = MESA_CV(
MESA_modality(task="regression", top_n=50, random_state=42),
task="regression",
)
cv_eval.fit(X1_train, y_train_continuous)
r2 = cv_eval.get_performance()
rmse = cv_eval.get_performance("neg_root_mean_squared_error")MESA can prune correlated CpG-like features after the first univariate selector and before Boruta.
redundancy_pruning="score": keep the best feature in each correlated block using task-aware univariate rankingredundancy_pruning="model": keep the best feature in each correlated block using model-based cross-validated rankingredundancy_threshold: absolute correlation threshold used to define redundant blocksredundancy_method: correlation method, e.g."pearson"
This step is useful when neighboring or highly correlated features carry redundant signal and would otherwise crowd out other informative loci.
task: learning task, either"classification"or"regression".random_state: random seed used by the default task-aware estimators.boruta_estimator: estimator used inside Boruta. If omitted, MESA uses a task-aware random forest.top_n: number of Boruta-selected features retained in the final modality model.variance_threshold: threshold passed toVarianceThresholdafter missing-value handling.normalization: whether to insertNormalizer()before variance filtering.missing: maximum missing fraction tolerated per feature before that feature is removed.redundancy_pruning: optional correlated-feature pruning strategy, eitherNone,"score", or"model".redundancy_threshold: absolute correlation threshold used to define redundant feature blocks.redundancy_method: correlation method used during redundancy pruning, for example"pearson".redundancy_estimator: estimator used to rank features within correlated blocks in"model"mode.redundancy_cv: cross-validation strategy used by model-based redundancy pruning.redundancy_metric: optional task-aware metric used in model-based redundancy pruning.predictor: final modality-level estimator. If omitted, MESA uses a task-aware random forest.classifier: backward-compatible alias forpredictor.selector: first-stage univariate selector. An integer is interpreted askin aGenericUnivariateSelect;Noneuses the task default.
modalities: list ofMESA_modalityobjects, one per modality.task: shared learning task for every modality and the meta-estimator.meta_estimator: second-level estimator fitted on modality outputs. If omitted, MESA uses logistic regression for classification and linear regression for regression.random_state: random seed used by the default stacking cross-validation splitter.cv: cross-validation splitter used to generate out-of-fold modality predictions for stacking.
modality: aMESA_modalityorMESAobject evaluated across folds.task: learning task used for default splitting and scoring.random_state: random seed used by the default cross-validation splitter.cv: explicit cross-validation splitter. If omitted, MESA uses a task-aware default.performance_metric: default metric returned byget_performance()when no explicit metric is supplied.
Classification defaults to ROC AUC. Regression defaults to R². Supported regression metrics are r2, neg_mean_squared_error, neg_root_mean_squared_error, pearson, and spearman.
- demo.ipynb: original example notebook
- pruning_validation_demo.ipynb: pruning-focused synthetic validation
- regression_validation_demo.ipynb: regression validation notebook
- scripts/run_smoke_checks.py: notebook-free smoke test runner
- Use pandas
DataFrameinputs when possible so selected feature indices can be mapped back to columns cleanly. - For biological interpretation, validate any pruning or selector change on a subset before large runs; these changes can alter feature rankings and downstream performance.
- Human contributor guidance lives in CONTRIBUTING.md.
If you use MESA in research, cite:
Li, Y., Xu, J., Chen, C. et al. Multimodal epigenetic sequencing analysis (MESA) of cell-free DNA for non-invasive colorectal cancer detection. Genome Medicine 16, 9 (2024). https://doi.org/10.1186/s13073-023-01280-6
This repository is distributed under the terms in LICENSE.