MAL-PREDICT

A machine learning pipeline for patient-level malaria therapeutic failure prediction, with support for:

manual clinical feature sets
automatic feature selection with scikit-learn
optional VCF-derived features per sample
model comparison across multiple classifiers and imbalance strategies
Optuna hyperparameter tuning
probability calibration and threshold selection
repeated evaluation across random seeds
export of a reusable model bundle for inference on new patients

Project context

This repository was created within the FCT-funded project in Artificial Intelligence, Data Science and Cybersecurity of relevance to Public Administration:

Reference: 2024.07292.IACDC
Principal Investigator: Maria Isabel Mendes Veiga
Project title (PT): Predição de Falha Terapêutica na Malária usando Aprendizagem Automática
Project title (EN): Malaria Therapeutic Failure Prediction using Machine Learning
Acronym: MAL-PREDICT

The broader project aims to develop a machine learning model to predict malaria treatment failure at the patient level, enabling more personalised and effective care. The current work focuses on artemether-lumefantrine (AL), the first-line therapy in Angola, where clinical trial data from PACTR202405595426468 indicated suspected partial resistance. Using patient-level data, including demographic, clinical, and parasitological indicators, this repository supports the development and comparison of predictive models for frontline decision support.

Purpose of this repository

This public repository was prepared to support a reproducible comparison of three main analysis strategies:

Manual clinical feature sets based on domain knowledge.
Automatic feature selection with scikit-learn.
Optional inclusion of VCF-derived genomic features, either combined with clinical variables or used on their own.

In other words, the repository is designed to help compare:

manual variable selection vs automatic feature selection
clinical-only vs clinical + VCF vs VCF-only workflows
different classifiers, imbalance strategies, and calibration choices

Repository structure

MAL-PREDICT/
├── artifacts/                  # exported trained models (kept empty in git)
├── configs/
│   └── examples/              # example YAML configurations
├── data/
│   ├── example/               # small example tabular dataset
│   └── raw/                   # place private training data here (ignored by git)
├── outputs/                   # training outputs (kept empty in git)
├── config.yaml                # main configuration template
├── predict_new_patients.py    # inference script
├── repeat_evaluation.py       # repeated training across random seeds
├── train_final_pipeline.py    # main training and model comparison pipeline
└── requirements.txt

Installation

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

xgboost is optional. If it is not installed, XGBoost candidates are skipped automatically.

Input data

1. Clinical table

Place the main tabular dataset in one of these locations:

data/raw/patients.csv
data/raw/patients.xlsx

The example file included in the repository is:

data/example/patients_example.csv

2. Per-sample VCF files

If you want to include genomic features, place VCF files in:

data/raw/vcfs/

Each VCF filename should match the sample identifier stored in the clinical table. Example:

S001.vcf
S002.vcf.gz

The column used to match clinical rows to VCF files is controlled by:

vcf.sample_id_col

Binary target definition

By default, the repository uses:

success = ACPR
failure = ETF, LCF, LPF

Other outcomes can be excluded with:

target:
  drop_other_outcomes: true

Main workflow modes

Clinical table only with manual features

feature_sources:
  use_table_features: true
  use_vcf_features: false

feature_selection:
  mode: manual

Clinical table only with automatic feature selection

feature_sources:
  use_table_features: true
  use_vcf_features: false

feature_selection:
  mode: sklearn
  sklearn:
    method: f_classif
    k: 20

Clinical table + VCF features

feature_sources:
  use_table_features: true
  use_vcf_features: true

VCF-only workflow

feature_sources:
  use_table_features: false
  use_vcf_features: true

Feature selection methods available

When feature_selection.mode: sklearn is enabled, the following methods are available:

f_classif — ANOVA F-test ranking
mutual_info — mutual information ranking
variance_threshold — unsupervised low-variance filtering
model_l1 — model-based selection with L1-regularised logistic regression
model_tree — model-based selection with Extra Trees feature importances

Example:

feature_selection:
  mode: sklearn
  exclude_columns: []
  sklearn:
    name: sklearn_auto
    method: model_tree
    k: 20
    variance_threshold: 0.0
    threshold: median
    max_features: 25
    candidate_columns: null

Notes:

k is used by f_classif and mutual_info
variance_threshold is used by variance_threshold
threshold and max_features are mainly relevant for model_l1 and model_tree

Recommended VCF strategy for Plasmodium

For Plasmodium VCF files, the most practical starting point in this repository is:

use vcf.mode: targeted
use encoding: presence or encoding: allele_fraction
keep only FILTER=PASS variants
require functional annotation when available
focus on biologically relevant genes and coding effects

This is usually preferable to a genome-wide approach when the sample size is limited, because it reduces noise, improves interpretability, and helps control dimensionality.

How VCF files are converted into features

Each VCF is parsed into a tabular representation so it can be combined with clinical variables.

Matching logic

The pipeline links each table row to a VCF file using:

the configured sample identifier column
the VCF filename stem

Supported encodings

presence — binary 0/1 feature per variant
allele_fraction — continuous feature derived from AD/DP, AF, or genotype fallback
dosage — legacy genotype count representation

Annotation-aware filtering

If SnpEff ANN annotations are present, the pipeline can filter by:

target genes
target effects
impact level
presence of annotation

Training-time dimensionality control

Even after biological filtering, the training pipeline further limits VCF feature space by:

keeping only variants observed in at least vcf.min_train_samples_with_variant training samples
capping the number of retained variants with vcf.max_variants
adding the VCF__AVAILABLE feature to encode file availability

Training

python train_final_pipeline.py --config config.yaml

Main outputs:

outputs/final_candidate_results.csv
outputs/final_candidate_results_ranked.csv
outputs/final_model_summary.json
outputs/final_model_card.md
artifacts/best_model_bundle.joblib

Repeated evaluation across seeds

python repeat_evaluation.py

This script reruns training across multiple random seeds and stores summary tables in:

outputs/repeat_eval/

Prediction on new patients

Table-based model

python predict_new_patients.py \
  --model-bundle artifacts/best_model_bundle.joblib \
  --input data/raw/new_patients.csv \
  --output predictions.tsv

Model that also expects VCF features

python predict_new_patients.py \
  --model-bundle artifacts/best_model_bundle.joblib \
  --input data/raw/new_patients.csv \
  --vcf-dir data/raw/vcfs \
  --output predictions.tsv

For VCF-only models, the input table only needs the sample identifier column.

Example configuration files

Example YAML files are provided in:

configs/examples/manual_table_only.yaml
configs/examples/sklearn_model_tree_table_only.yaml
configs/examples/table_plus_targeted_vcf.yaml
configs/examples/vcf_only_targeted.yaml

Public repository notes

This repository has been prepared for public release with the following principles:

code comments and documentation are in English
IDE files, cached files, private raw data, generated artifacts, and outputs are ignored by git
the repository ships with an example dataset structure, not private patient data
configuration is template-based so users can adapt it to their own data

Limitations

Small datasets can make performance estimates unstable.
VCF-based modelling can become high-dimensional very quickly.
Targeted genomic filtering is often more realistic than genome-wide modelling for modest sample sizes.
Clinical deployment would require independent validation, calibration review, and governance approval.

Acknowledgements

This repository was developed in the context of the MAL-PREDICT project and related work on machine learning support for malaria treatment failure prediction in Angola.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
artifacts		artifacts
configs/examples		configs/examples
data		data
outputs		outputs
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
predict_new_patients.py		predict_new_patients.py
repeat_evaluation.py		repeat_evaluation.py
requirements.txt		requirements.txt
train_final_pipeline.py		train_final_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

MAL-PREDICT

Project context

Purpose of this repository

Repository structure

Installation

Input data

1. Clinical table

2. Per-sample VCF files

Binary target definition

Main workflow modes

Clinical table only with manual features

Clinical table only with automatic feature selection

Clinical table + VCF features

VCF-only workflow

Feature selection methods available

Recommended VCF strategy for Plasmodium

How VCF files are converted into features

Matching logic

Supported encodings

Annotation-aware filtering

Training-time dimensionality control

Training

Repeated evaluation across seeds

Prediction on new patients

Table-based model

Model that also expects VCF features

Example configuration files

Public repository notes

Limitations

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages