A machine learning pipeline for patient-level malaria therapeutic failure prediction, with support for:
- manual clinical feature sets
- automatic feature selection with scikit-learn
- optional VCF-derived features per sample
- model comparison across multiple classifiers and imbalance strategies
- Optuna hyperparameter tuning
- probability calibration and threshold selection
- repeated evaluation across random seeds
- export of a reusable model bundle for inference on new patients
This repository was created within the FCT-funded project in Artificial Intelligence, Data Science and Cybersecurity of relevance to Public Administration:
- Reference: 2024.07292.IACDC
- Principal Investigator: Maria Isabel Mendes Veiga
- Project title (PT): Predição de Falha Terapêutica na Malária usando Aprendizagem Automática
- Project title (EN): Malaria Therapeutic Failure Prediction using Machine Learning
- Acronym: MAL-PREDICT
The broader project aims to develop a machine learning model to predict malaria treatment failure at the patient level, enabling more personalised and effective care. The current work focuses on artemether-lumefantrine (AL), the first-line therapy in Angola, where clinical trial data from PACTR202405595426468 indicated suspected partial resistance. Using patient-level data, including demographic, clinical, and parasitological indicators, this repository supports the development and comparison of predictive models for frontline decision support.
This public repository was prepared to support a reproducible comparison of three main analysis strategies:
- Manual clinical feature sets based on domain knowledge.
- Automatic feature selection with scikit-learn.
- Optional inclusion of VCF-derived genomic features, either combined with clinical variables or used on their own.
In other words, the repository is designed to help compare:
- manual variable selection vs automatic feature selection
- clinical-only vs clinical + VCF vs VCF-only workflows
- different classifiers, imbalance strategies, and calibration choices
MAL-PREDICT/
├── artifacts/ # exported trained models (kept empty in git)
├── configs/
│ └── examples/ # example YAML configurations
├── data/
│ ├── example/ # small example tabular dataset
│ └── raw/ # place private training data here (ignored by git)
├── outputs/ # training outputs (kept empty in git)
├── config.yaml # main configuration template
├── predict_new_patients.py # inference script
├── repeat_evaluation.py # repeated training across random seeds
├── train_final_pipeline.py # main training and model comparison pipeline
└── requirements.txt
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtxgboost is optional. If it is not installed, XGBoost candidates are skipped automatically.
Place the main tabular dataset in one of these locations:
data/raw/patients.csvdata/raw/patients.xlsx
The example file included in the repository is:
data/example/patients_example.csv
If you want to include genomic features, place VCF files in:
data/raw/vcfs/
Each VCF filename should match the sample identifier stored in the clinical table. Example:
S001.vcfS002.vcf.gz
The column used to match clinical rows to VCF files is controlled by:
vcf.sample_id_col
By default, the repository uses:
- success =
ACPR - failure =
ETF,LCF,LPF
Other outcomes can be excluded with:
target:
drop_other_outcomes: truefeature_sources:
use_table_features: true
use_vcf_features: false
feature_selection:
mode: manualfeature_sources:
use_table_features: true
use_vcf_features: false
feature_selection:
mode: sklearn
sklearn:
method: f_classif
k: 20feature_sources:
use_table_features: true
use_vcf_features: truefeature_sources:
use_table_features: false
use_vcf_features: trueWhen feature_selection.mode: sklearn is enabled, the following methods are available:
f_classif— ANOVA F-test rankingmutual_info— mutual information rankingvariance_threshold— unsupervised low-variance filteringmodel_l1— model-based selection with L1-regularised logistic regressionmodel_tree— model-based selection with Extra Trees feature importances
Example:
feature_selection:
mode: sklearn
exclude_columns: []
sklearn:
name: sklearn_auto
method: model_tree
k: 20
variance_threshold: 0.0
threshold: median
max_features: 25
candidate_columns: nullNotes:
kis used byf_classifandmutual_infovariance_thresholdis used byvariance_thresholdthresholdandmax_featuresare mainly relevant formodel_l1andmodel_tree
For Plasmodium VCF files, the most practical starting point in this repository is:
- use
vcf.mode: targeted - use
encoding: presenceorencoding: allele_fraction - keep only
FILTER=PASSvariants - require functional annotation when available
- focus on biologically relevant genes and coding effects
This is usually preferable to a genome-wide approach when the sample size is limited, because it reduces noise, improves interpretability, and helps control dimensionality.
Each VCF is parsed into a tabular representation so it can be combined with clinical variables.
The pipeline links each table row to a VCF file using:
- the configured sample identifier column
- the VCF filename stem
presence— binary 0/1 feature per variantallele_fraction— continuous feature derived fromAD/DP,AF, or genotype fallbackdosage— legacy genotype count representation
If SnpEff ANN annotations are present, the pipeline can filter by:
- target genes
- target effects
- impact level
- presence of annotation
Even after biological filtering, the training pipeline further limits VCF feature space by:
- keeping only variants observed in at least
vcf.min_train_samples_with_varianttraining samples - capping the number of retained variants with
vcf.max_variants - adding the
VCF__AVAILABLEfeature to encode file availability
python train_final_pipeline.py --config config.yamlMain outputs:
outputs/final_candidate_results.csvoutputs/final_candidate_results_ranked.csvoutputs/final_model_summary.jsonoutputs/final_model_card.mdartifacts/best_model_bundle.joblib
python repeat_evaluation.pyThis script reruns training across multiple random seeds and stores summary tables in:
outputs/repeat_eval/
python predict_new_patients.py \
--model-bundle artifacts/best_model_bundle.joblib \
--input data/raw/new_patients.csv \
--output predictions.tsvpython predict_new_patients.py \
--model-bundle artifacts/best_model_bundle.joblib \
--input data/raw/new_patients.csv \
--vcf-dir data/raw/vcfs \
--output predictions.tsvFor VCF-only models, the input table only needs the sample identifier column.
Example YAML files are provided in:
configs/examples/manual_table_only.yamlconfigs/examples/sklearn_model_tree_table_only.yamlconfigs/examples/table_plus_targeted_vcf.yamlconfigs/examples/vcf_only_targeted.yaml
This repository has been prepared for public release with the following principles:
- code comments and documentation are in English
- IDE files, cached files, private raw data, generated artifacts, and outputs are ignored by git
- the repository ships with an example dataset structure, not private patient data
- configuration is template-based so users can adapt it to their own data
- Small datasets can make performance estimates unstable.
- VCF-based modelling can become high-dimensional very quickly.
- Targeted genomic filtering is often more realistic than genome-wide modelling for modest sample sizes.
- Clinical deployment would require independent validation, calibration review, and governance approval.
This repository was developed in the context of the MAL-PREDICT project and related work on machine learning support for malaria treatment failure prediction in Angola.