Skip to content

PEvoGen-IT/MAL-Predict

Repository files navigation

MAL-PREDICT

A machine learning pipeline for patient-level malaria therapeutic failure prediction, with support for:

  • manual clinical feature sets
  • automatic feature selection with scikit-learn
  • optional VCF-derived features per sample
  • model comparison across multiple classifiers and imbalance strategies
  • Optuna hyperparameter tuning
  • probability calibration and threshold selection
  • repeated evaluation across random seeds
  • export of a reusable model bundle for inference on new patients

Project context

This repository was created within the FCT-funded project in Artificial Intelligence, Data Science and Cybersecurity of relevance to Public Administration:

  • Reference: 2024.07292.IACDC
  • Principal Investigator: Maria Isabel Mendes Veiga
  • Project title (PT): Predição de Falha Terapêutica na Malária usando Aprendizagem Automática
  • Project title (EN): Malaria Therapeutic Failure Prediction using Machine Learning
  • Acronym: MAL-PREDICT

The broader project aims to develop a machine learning model to predict malaria treatment failure at the patient level, enabling more personalised and effective care. The current work focuses on artemether-lumefantrine (AL), the first-line therapy in Angola, where clinical trial data from PACTR202405595426468 indicated suspected partial resistance. Using patient-level data, including demographic, clinical, and parasitological indicators, this repository supports the development and comparison of predictive models for frontline decision support.

Purpose of this repository

This public repository was prepared to support a reproducible comparison of three main analysis strategies:

  1. Manual clinical feature sets based on domain knowledge.
  2. Automatic feature selection with scikit-learn.
  3. Optional inclusion of VCF-derived genomic features, either combined with clinical variables or used on their own.

In other words, the repository is designed to help compare:

  • manual variable selection vs automatic feature selection
  • clinical-only vs clinical + VCF vs VCF-only workflows
  • different classifiers, imbalance strategies, and calibration choices

Repository structure

MAL-PREDICT/
├── artifacts/                  # exported trained models (kept empty in git)
├── configs/
│   └── examples/              # example YAML configurations
├── data/
│   ├── example/               # small example tabular dataset
│   └── raw/                   # place private training data here (ignored by git)
├── outputs/                   # training outputs (kept empty in git)
├── config.yaml                # main configuration template
├── predict_new_patients.py    # inference script
├── repeat_evaluation.py       # repeated training across random seeds
├── train_final_pipeline.py    # main training and model comparison pipeline
└── requirements.txt

Installation

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

xgboost is optional. If it is not installed, XGBoost candidates are skipped automatically.

Input data

1. Clinical table

Place the main tabular dataset in one of these locations:

  • data/raw/patients.csv
  • data/raw/patients.xlsx

The example file included in the repository is:

  • data/example/patients_example.csv

2. Per-sample VCF files

If you want to include genomic features, place VCF files in:

  • data/raw/vcfs/

Each VCF filename should match the sample identifier stored in the clinical table. Example:

  • S001.vcf
  • S002.vcf.gz

The column used to match clinical rows to VCF files is controlled by:

  • vcf.sample_id_col

Binary target definition

By default, the repository uses:

  • success = ACPR
  • failure = ETF, LCF, LPF

Other outcomes can be excluded with:

target:
  drop_other_outcomes: true

Main workflow modes

Clinical table only with manual features

feature_sources:
  use_table_features: true
  use_vcf_features: false

feature_selection:
  mode: manual

Clinical table only with automatic feature selection

feature_sources:
  use_table_features: true
  use_vcf_features: false

feature_selection:
  mode: sklearn
  sklearn:
    method: f_classif
    k: 20

Clinical table + VCF features

feature_sources:
  use_table_features: true
  use_vcf_features: true

VCF-only workflow

feature_sources:
  use_table_features: false
  use_vcf_features: true

Feature selection methods available

When feature_selection.mode: sklearn is enabled, the following methods are available:

  • f_classif — ANOVA F-test ranking
  • mutual_info — mutual information ranking
  • variance_threshold — unsupervised low-variance filtering
  • model_l1 — model-based selection with L1-regularised logistic regression
  • model_tree — model-based selection with Extra Trees feature importances

Example:

feature_selection:
  mode: sklearn
  exclude_columns: []
  sklearn:
    name: sklearn_auto
    method: model_tree
    k: 20
    variance_threshold: 0.0
    threshold: median
    max_features: 25
    candidate_columns: null

Notes:

  • k is used by f_classif and mutual_info
  • variance_threshold is used by variance_threshold
  • threshold and max_features are mainly relevant for model_l1 and model_tree

Recommended VCF strategy for Plasmodium

For Plasmodium VCF files, the most practical starting point in this repository is:

  • use vcf.mode: targeted
  • use encoding: presence or encoding: allele_fraction
  • keep only FILTER=PASS variants
  • require functional annotation when available
  • focus on biologically relevant genes and coding effects

This is usually preferable to a genome-wide approach when the sample size is limited, because it reduces noise, improves interpretability, and helps control dimensionality.

How VCF files are converted into features

Each VCF is parsed into a tabular representation so it can be combined with clinical variables.

Matching logic

The pipeline links each table row to a VCF file using:

  • the configured sample identifier column
  • the VCF filename stem

Supported encodings

  • presence — binary 0/1 feature per variant
  • allele_fraction — continuous feature derived from AD/DP, AF, or genotype fallback
  • dosage — legacy genotype count representation

Annotation-aware filtering

If SnpEff ANN annotations are present, the pipeline can filter by:

  • target genes
  • target effects
  • impact level
  • presence of annotation

Training-time dimensionality control

Even after biological filtering, the training pipeline further limits VCF feature space by:

  • keeping only variants observed in at least vcf.min_train_samples_with_variant training samples
  • capping the number of retained variants with vcf.max_variants
  • adding the VCF__AVAILABLE feature to encode file availability

Training

python train_final_pipeline.py --config config.yaml

Main outputs:

  • outputs/final_candidate_results.csv
  • outputs/final_candidate_results_ranked.csv
  • outputs/final_model_summary.json
  • outputs/final_model_card.md
  • artifacts/best_model_bundle.joblib

Repeated evaluation across seeds

python repeat_evaluation.py

This script reruns training across multiple random seeds and stores summary tables in:

  • outputs/repeat_eval/

Prediction on new patients

Table-based model

python predict_new_patients.py \
  --model-bundle artifacts/best_model_bundle.joblib \
  --input data/raw/new_patients.csv \
  --output predictions.tsv

Model that also expects VCF features

python predict_new_patients.py \
  --model-bundle artifacts/best_model_bundle.joblib \
  --input data/raw/new_patients.csv \
  --vcf-dir data/raw/vcfs \
  --output predictions.tsv

For VCF-only models, the input table only needs the sample identifier column.

Example configuration files

Example YAML files are provided in:

  • configs/examples/manual_table_only.yaml
  • configs/examples/sklearn_model_tree_table_only.yaml
  • configs/examples/table_plus_targeted_vcf.yaml
  • configs/examples/vcf_only_targeted.yaml

Public repository notes

This repository has been prepared for public release with the following principles:

  • code comments and documentation are in English
  • IDE files, cached files, private raw data, generated artifacts, and outputs are ignored by git
  • the repository ships with an example dataset structure, not private patient data
  • configuration is template-based so users can adapt it to their own data

Limitations

  • Small datasets can make performance estimates unstable.
  • VCF-based modelling can become high-dimensional very quickly.
  • Targeted genomic filtering is often more realistic than genome-wide modelling for modest sample sizes.
  • Clinical deployment would require independent validation, calibration review, and governance approval.

Acknowledgements

This repository was developed in the context of the MAL-PREDICT project and related work on machine learning support for malaria treatment failure prediction in Angola.

About

MAL-Predict - A machine learning pipeline for patient-level malaria therapeutic failure prediction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages