This project involves developing a predictive model for length of stay (LOS) using MIMIC-IV. The MIMIC-IV dataset (Medical Information Mart for Intensive Care IV) is a publicly available dataset that contains comprehensive and de-identified health-related data for critical care patients. It is widely used in the medical and machine learning communities for research in healthcare, particularly in areas like predictive modeling, clinical decision support, and patient outcome analysis. The goal is to enable healthcare providers to allocate resources more efficiently and manage patient care effectively.
The primary objective of this project is to create a predictive model capable of identifying patients who are likely to stay in the hospital for seven or more days. Accurate predictions of LOS allow healthcare systems to optimize resource allocation, improve operational efficiency, and enhance patient outcomes.
The original dataset consists of 100,000 patient records and includes 28 features. It contains comprehensive information about patients, such as:
- Medical history
- Number of previous admissions
- Diagnostic codes
- Laboratory test results
These features provide the foundation for building and validating predictive models for LOS.
The dataset was notably clean, containing no missing values. This eliminated the need for imputation or other data-filling techniques and allowed us to focus on feature engineering and modeling.
To prepare the dataset for modeling, the following transformations were applied:
- Encoding categorical variables: The gender variable was encoded to a numerical format to ensure compatibility with machine learning algorithms.
- Standardizing continuous variables: Continuous laboratory measurements were standardized to ensure that all features were on a comparable scale, improving model convergence and performance.
Feature engineering was conducted to refine the dataset and improve the predictive power of the model:
- Creation of the
number_of_issuesfeature:
A new feature was derived to represent the total number of medical conditions a patient had. This helped to consolidate information and reduced the number of input features by 11, simplifying the model. - Feature selection:
Unnecessary features that did not significantly contribute to predicting LOS were removed. - Target variable transformation:
A binary target variable was created to represent whether a patient’s LOS was greater than 7 days. This binary classification approach aligns with the project objective.
Multiple machine learning models were tested to identify the best-performing approach for LOS prediction. Key steps included:
- The dataset was split into training (80%) and testing (20%) subsets to evaluate model performance.
- A holdout validation set was considered to ensure robust evaluation.
- 5-Fold Cross-Validation: This was employed to ensure model stability and prevent overfitting.
- Random Search: Hyperparameters were tuned using random search for each model type, optimizing performance metrics.
The following models were explored:
- Logistic Regression
- Support Vector Machines (SVM)
- Random Forest
- Gradient Boosting Machine (GBM)
Given the imbalanced nature of the dataset, tree-based methods outperformed other models due to their ability to handle complex feature interactions and class imbalance effectively. Evaluation metrics included:
- ROC-AUC Score: This was used as the primary metric to assess model performance.
- Gradient Boosting Machine (GBM):
GBM emerged as the best-performing model with the highest ROC-AUC score, demonstrating its ability to make accurate predictions even with an imbalanced dataset.
To enhance the interpretability of the best-performing model, we employed two popular explainability techniques:
SHAP values were used to quantify the contribution of each feature to the model's predictions. This approach provided both global and local understanding of the model. Key insights included:
- Features such as
number_of_issuesand specific lab results had the most significant impact on predictions. - The visualization of SHAP values helped identify critical thresholds and patterns in the data.
LIME was applied to create interpretable explanations for individual predictions. It was particularly useful in understanding why certain patients were predicted to have a Length of Stay (LOS) greater than 7 days. This technique allowed us to:
- Validate the model’s behavior on edge cases.
- Provide actionable insights to healthcare providers by highlighting specific features driving predictions.
These explainability tools ensured that the GBM model’s predictions were transparent and trustworthy, making it easier for stakeholders to adopt and rely on the model in clinical settings.
This project successfully developed a predictive model for hospital length of stay, providing actionable insights to improve healthcare management. By leveraging advanced machine learning techniques and rigorous evaluation, the GBM model offers a reliable tool for predicting extended patient stays, enabling better resource planning and patient care.