Analysis of tomato quality using Near-Infrared Spectroscopy (NIR)
- β¨ Features
- π Project Structure
- π Installation
- π Usage
- π Experiment Tracking with MLflow
- π§ͺ Testing
- π License
-
π Pre-processing of NIR spectral data:
- Spectral transformations (SNV, MSC)
- Savitzky-Golay filtering
- Automatic detection and filtering of non-numeric columns
- Outlier detection and removal
-
π§ Modeling of NIR data:
- PLS regression
- Support Vector Regression (SVR)
- Random Forest regression
- XGBoost regression
- LightGBM regression
-
π Advanced model optimization:
- Hyperparameter tuning with Optuna
- Feature selection methods (Genetic Algorithm, CARS, VIP)
- Integrated cross-validation
-
π Experiment tracking with MLflow:
- Parameter logging
- Metrics tracking
- Model artifacts storage
- Feature importance visualization
-
π§ͺ Quality assurance:
- Comprehensive test suite
- Code quality checks with ruff and black
NIRS/
βββ configs/ # Configuration files for experiments
β βββ pls_snv_savgol.yaml # PLS model with SNV and Savitzky-Golay
β βββ rf_msc_feature_selection.yaml # Random Forest with feature selection
β βββ xgb_genetic_algorithm.yaml # XGBoost with genetic algorithm
β βββ rf_hyperparams_tuning.yaml # Random Forest with hyperparameter tuning
β βββ README.md # Documentation for config files
βββ data/ # Data directory
β βββ raw/ # Raw input data files
β β βββ Tomato_Viavi_Brix_model_pulp.csv # Example tomato NIR spectra dataset
β βββ processed/ # Processed data files
βββ experiments/ # Experiment scripts
β βββ analyze_models.py # Script for analyzing model performance
β βββ create_config.py # Tool for creating experiment configs
β βββ experiment_manager.py # Manages experiment execution
β βββ process_data.py # Data processing utilities
β βββ process_tomato_data.py # Tomato-specific data processing
β βββ run_experiment.py # Main experiment runner
β βββ run_from_config.py # Run experiments from config files
β βββ run_experiments.py # Extended experiment runner
β βββ run_mlflow_server.py # MLflow server launcher
β βββ train_model.py # Model training script
βββ images/ # Images for documentation
β βββ tomato.png # Visualization of tomato NIR analysis
βββ mlruns/ # MLflow experiment tracking data
βββ models/ # Saved model files
βββ nirs_tomato/ # Main package
β βββ config.py # Configuration utilities
β βββ data_processing/ # Data processing modules
β β βββ constants.py # Constant definitions
β β βββ feature_selection.py # Feature selection methods
β β βββ pipeline/ # Pipeline implementations
β β βββ transformers.py # Spectral transformers
β β βββ utils.py # Utility functions
β βββ modeling/ # Modeling and evaluation modules
β β βββ evaluation.py # Model evaluation tools
β β βββ hyperparameter_tuning.py # Hyperparameter optimization
β β βββ model_factory.py # Model creation factory
β β βββ regression_models.py # Regression model implementations
β β βββ tracking.py # MLflow experiment tracking
β βββ __init__.py # Package initialization
βββ results/ # Experiment results and outputs
βββ tests/ # Test files
β βββ test_data_processing/ # Tests for data processing
β βββ test_modeling/ # Tests for modeling
βββ .gitignore # Git ignore file
βββ .flake8 # Flake8 configuration
βββ pyproject.toml # Project configuration
βββ README.md # This file
Clone this repository and install the package using pip:
git clone https://github.com/yourusername/NIRS.git
cd NIRS
pip install -e ".[dev]"Here's a quick example to get you started with analyzing tomato NIR spectra:
from nirs_tomato.data_processing.transformers import SNVTransformer
from nirs_tomato.data_processing.pipeline.data_processing import preprocess_spectra
from nirs_tomato.modeling.model_factory import create_model
from nirs_tomato.modeling.evaluation import evaluate_regression_model
import pandas as pd
from sklearn.model_selection import train_test_split
# 1. Load your NIR data
df = pd.read_csv('data/raw/Tomato_Viavi_Brix_model_pulp.csv')
# 2. Process the spectral data
results = preprocess_spectra(
df=df,
target_column='Brix',
transformers=[SNVTransformer()],
exclude_columns=['Instrument Serial Number', 'Notes', 'Timestamp'],
remove_outliers=False,
verbose=True
)
# 3. Get processed features and target
X, y = results['X'], results['y']
# 4. Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 5. Create and train a model
model = create_model("pls", n_components=10)
model.fit(X_train, y_train)
# 6. Evaluate the model
metrics, y_pred = evaluate_regression_model(model, X_test, y_test)
print(f"Model performance:")
print(f" RΒ² score: {metrics['r2']:.4f}")
print(f" RMSE: {metrics['rmse']:.4f}")
print(f" MAE: {metrics['mae']:.4f}")The package provides tools for data processing, including transformers for spectral data, utilities for data cleaning, and pipelines for complete data processing workflows.
from nirs_tomato.data_processing.transformers import SNVTransformer
from nirs_tomato.data_processing.pipeline.data_processing import preprocess_spectra
import pandas as pd
# Load data
df = pd.read_csv('data/raw/Tomato_Viavi_Brix_model_pulp.csv')
# Process data
results = preprocess_spectra(
df=df,
target_column='Brix',
transformers=[SNVTransformer()],
exclude_columns=['Instrument Serial Number', 'Notes', 'Timestamp'],
remove_outliers=False,
verbose=True
)
# Get processed features and target
X = results['X']
y = results['y']The package provides command-line scripts for model training, as well as Python functions for creating and evaluating models.
# Train a PLS model with SNV transformation
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model pls --transform snv
# Train an XGBoost model with MSC transformation and Savitzky-Golay filtering
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model xgb --transform msc --savgol --window_length 15 --polyorder 2 --tune_hyperparams
# Train a Random Forest model with feature selection
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model rf --transform snv --feature_selection vip --n_features 20from nirs_tomato.modeling.model_factory import create_model
from nirs_tomato.modeling.evaluation import evaluate_regression_model
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create model
model = create_model("pls", n_components=10)
# Train model
model.fit(X_train, y_train)
# Evaluate model
metrics, y_pred = evaluate_regression_model(model, X_test, y_test)
print(f"R2 score: {metrics['r2']:.4f}")The package provides multiple ways to run experiments:
The simplest way to run experiments is using YAML configuration files:
# Run a single experiment from a config file
python experiments/run_from_config.py --config configs/pls_snv_savgol.yaml
# Run all experiments in the configs directory
python experiments/run_from_config.py --config_dir configs/
# Run an experiment with verbose output
python experiments/run_from_config.py --config configs/rf_msc_feature_selection.yamlYou can create your own experiment configuration files using the create_config.py script:
# Create a new configuration file
python experiments/create_config.py --name my_experiment --data_path data/raw/Tomato_Viavi_Brix_model_pulp.csv --target_column Brix --model rf --transform snv --output configs/my_experiment.yamlYou can also run experiments programmatically:
from experiments.experiment_manager import ExperimentManager
from nirs_tomato.config import ExperimentConfig
# Create experiment manager
manager = ExperimentManager()
# Run from config file
results = manager.run_from_config("configs/pls_snv_savgol.yaml")
# Or create a config object programmatically
config = ExperimentConfig.from_yaml("configs/pls_snv_savgol.yaml")
config.model.model_type = "rf" # Change model type
config.data.transform = "msc" # Change transformation
# Run with modified config
results = manager.run_from_config_object(config)The package includes functions for visualizing results:
import matplotlib.pyplot as plt
import numpy as np
# Plot regression results
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')
plt.xlabel('Actual Brix')
plt.ylabel('Predicted Brix')
plt.title('Predicted vs Actual Brix Values')
plt.tight_layout()
plt.savefig("results/regression_plot.png")The package integrates with MLflow for experiment tracking, which helps you organize and compare your experiments.
MLflow is included in the project dependencies. To start the MLflow tracking server:
# Start MLflow server with local storage
python experiments/run_mlflow_server.py --host 0.0.0.0 --port 5000
# Start with custom backend store
python experiments/run_mlflow_server.py --backend-store-uri sqlite:///mlflow.dbTo enable MLflow tracking in your experiments, set mlflow.enabled: true in your configuration file:
# In your YAML config file
mlflow:
enabled: true
experiment_name: nirs-tomato-brixOr enable it programmatically:
from nirs_tomato.modeling.tracking import start_run, log_parameters, log_metrics, log_model, end_run
# Start a run
with start_run(experiment_name="nirs-tomato-test"):
# Log parameters
log_parameters({"model_type": "pls", "n_components": 10})
# Train your model
# ...
# Log metrics
log_metrics({"r2": 0.95, "rmse": 0.05})
# Log model
log_model(model, "pls_model")Once your MLflow server is running, you can view your experiments in the MLflow UI:
- Open your browser and navigate to
http://localhost:5000(or your custom host:port) - Browse experiments by name
- Compare runs, view metrics, and download models
Run the test suite using pytest:
# Run all tests
pytest
# Run with coverage report
pytest --cov=nirs_tomato
# Run specific test module
pytest tests/test_data_processing/test_transformers.pyThis project uses GitHub Actions for continuous integration. The CI pipeline:
- Runs tests on multiple Python versions
- Ensures code quality with ruff and other linters
- Generates coverage reports
This project is licensed under the MIT License.
Built for tomato quality analysis using NIR spectroscopy
