Mycotoxins are toxins produced by fungi that colonize food crops and can cause serious economic burden to the US corn industry. Aflatoxins may occur in food because of mold growth in susceptible raw agricultural products. The growth of molds that produce aflatoxins is influenced by environmental factors such as temperature, humidity, and extent of rainfall during the pre-harvesting, harvesting, or post-harvesting periods (FDA). It has been estimated that aflatoxin contamination could cause losses ranging from $52.1 million to $1.68 billion annually in the United States (Castano-Duque et al). It has been proven that healthier plants are less at risk. Also, plants are more at risk in hot and dry conditions. Pre-and post-planting weather factors are significantly correlated with Aflatoxin and Fumonisin contamination at harvest which is why I gathered weather data from six months before and after the planting dates given in my datasets. Aflatoxins are one of the most problematic because of their carcinogenic and mutagenic properties that can contaminate food and feed. The FDA says that human food containing total aflatoxins greater than 20 micrograms per kilogram, or ppb may be injurious to health. Consumption of food with high levels of aflatoxins is also associated with liver cancer in humans. For my Capstone project I trained a model to predict the Aflatoxin contamination at harvest using weather data. In Capstone 1 I created a baseline model and a linear regression model.
Source:
https://agpmt.org/data-management/
Type:
CSV
Size:
(1048576, 37)
Key features:
Location, NOLA_ID, Planting_Date, Harvest_Date, GPS, Temperature, Precipiation, NDVI
I had to make sure each row had a PLanting_Date, Harvest_Date, and GPS coordinates otherwise those rows were not usefull. I collected the weather data hourly at first and then daily and weekly.
The only weather feature with missing values was NDVI since I used a quailty assurance mask to only use values that were valid. This means when it was very cloudy or there was snow that NDVI was not included in my data.
GPS was split into Latitude and Longitude. Aflatoin Risk Index code was used in R to calculate that weekly value as well as features like Growing Degree Days and Cumulative Growing Degree Days.
Locations of GPS coordinated for each year.
Aflatoxin contamination results at harvest.
Example of NDVI.
Confustion matrix with only weather features.
Confusion Matrix with ARI features. The correlation matrix shows that no single variable strongly explains aflatoxin contamination.
However, strong correlations exist between engineered features such as GDD, growth variables, and ARI, indicating redundancy.
This suggests that aflatoxin prediction depends on nonlinear interactions between environmental and biological factors, rather than simple linear relationships.
Baseline model: Random Forest using GroupKFold with 5 folds by Location. I used grouped cross-validation by location to avoid spatial leakage. This ensures the model is tested on locations it has not seen before. Results are an RMSE of 2.30, MAE of 0.64, and an R² of -0.09. The negative R² suggests that predicting aflatoxin across different locations is challenging, likely due to strong environmental and regional differences.
One advanced model:
I moved on to a two-stage approach. The first stage being a classification model to determine contamination or no contamination. Stage two is a regression mode to preditc aflatoxin amount only for those contaminated cases.
Why I chose them:
It is important to note that my dataset is about 90% zeroes in the aflatoxin target, meaning 90% of the crops were not contaminated. When I evaluated the baseline random forest model specifically on non-zero aflatoxin cases, the error increased substantially. This shows that while the model handles the majority of zero cases well, it struggles to predict actual contamination events. This highlights the need for a different modeling strategy, the two-stage approach.
Tools used
Hyperparameters:
Starting at a default threshold of 0.5 with the regression model gave me
RMSE: 1.6887343388777052
MAE: 1.6695052547403957
R²: -4.507769503825259e+30
When I changed the threshold to 0.35, which is appropriate for this project where missing contamination is more hurtful than false positives. My results then greatly improved and my results overall were.
RMSE: 0.8799
MAE : 0.2648
R² : 0.3875
And for non-zero only:
RMSE: 1.6258
MAE : 0.8014
R² : 0.2799
To evaluate the models, I used both classification and regression metrics because the project has two goals. First, predicting whether aflatoxin contamination occurs then estimating the aflatoxin level.
For the classification model I used:
Accuracy: Measures the overall percentage of correct predictions. However, because aflatoxin data can be imbalanced, accuracy alone is not enough.
Precision: Measures how many predicted contamination cases were actually contaminated. This is important because false alarms could lead to unnecessary concern or extra testing.
Recall: Measures how many true contamination cases the model successfully identified. This is especially important because missing a contaminated sample could have serious agricultural and food safety consequences.
F1-score: Balances precision and recall, making it useful when the dataset has uneven classes.
ROC-AUC: Measures how well the model separates contaminated and non-contaminated samples across different thresholds.
For the regression model, I used:
MAE: Measures the average size of prediction errors. It is easy to interpret because it is in the same general scale as the target variable.
RMSE: Penalizes larger errors more strongly, which is useful because very high aflatoxin levels are especially important to predict.
R²: Shows how much variation in aflatoxin levels the model explains.
Baseline Models Comparison Table:

Model interpretation was performed using feature importance and SHAP values. These methods helped identify which variables had the strongest influence on aflatoxin prediction.
Important features included variables such as NDVI, temperature, precipitation, relative humidity, GDD, cumulative GDD, and days from planting. These features are biologically meaningful because aflatoxin contamination is affected by plant stress, weather conditions, crop development, and environmental moisture.
SHAP plots were especially useful because they showed not only which features mattered most, but also how high or low values of those features influenced the model predictions.

The best performing models were tree based machine learning models such as Random Forest, XGBoost, or LightGBM because they can capture nonlinear relationships between weather, vegetation indices, crop timing, and aflatoxin risk. Aflatoxin prediction is difficult because contamination is affected by many interacting biological and environmental factors. The dataset also contains many zero or contamination samples, which makes prediction more challenging. The practical impact of this project is that it could help identify fields or locations with higher aflatoxin risk before harvest. This can support better monitoring, targeted testing, and earlier decision-making for crop management and food safety.
In this project I developed a machine learning workflow to predict aflatoxin contamination using environmental, remote sensing, and agronomic data. Multiple models were compared using classification and regression metrics. The results showed that tree-based models were useful for identifying important risk factors and predicting contamination patterns. Overall, the project demonstrates how data science, remote sensing, and agricultural data can be combined to support aflatoxin risk prediction and help farmers.
Future improvements could include testing more advanced models such as neural networks. Also xpanding SHAP interpretation and editing models based on that information. Deploying the final model in a Streamlit app for farmers is a final step that would require exploring predicting weather features.
Install dependencies
Run preprocessing
Train model
Evaluate results
AggregateMetadata.ipynb - Join the data from 2023 and 2024 keeping in mind that GPS locations change each each even if Location matches.
AddWeatherNDVI.ipynb - Get weather features using point like temperature, precipitation, humidity and others frm the NLDAS satellite.
CountyWeather.ipynb - Get weather features using country like temperature, precipitation, humidity and others frm the NLDAS satellite. County is for anonymity and to check that there is not weather phenomenons occuring at point locations.
Sanity_checks.ipynb - Check for missing values and graph weather to features to check for following of seasons. For example, hottest in summer and coldest in winter. Higher precipition and NDVI in spring.
TwoStageMarch29.ipynb - Two Stage model
pandas numpy scikit-learn matplotlib seaborn joblib streamlit shap xgboost lightgbm earthengine-api geemap geopandas