Skip to content

caitlon/NIRS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

90 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ… NIRS - NIR Spectroscopy Analysis for Tomatoes

Analysis of tomato quality using Near-Infrared Spectroscopy (NIR)

Python MLflow License: MIT Code style: black

Tomato NIR Spectroscopy

πŸ“‘ Table of Contents

✨ Features

  • πŸ” Pre-processing of NIR spectral data:

    • Spectral transformations (SNV, MSC)
    • Savitzky-Golay filtering
    • Automatic detection and filtering of non-numeric columns
    • Outlier detection and removal
  • 🧠 Modeling of NIR data:

    • PLS regression
    • Support Vector Regression (SVR)
    • Random Forest regression
    • XGBoost regression
    • LightGBM regression
  • πŸ“Š Advanced model optimization:

    • Hyperparameter tuning with Optuna
    • Feature selection methods (Genetic Algorithm, CARS, VIP)
    • Integrated cross-validation
  • πŸ“ˆ Experiment tracking with MLflow:

    • Parameter logging
    • Metrics tracking
    • Model artifacts storage
    • Feature importance visualization
  • πŸ§ͺ Quality assurance:

    • Comprehensive test suite
    • Code quality checks with ruff and black

πŸ“‚ Project Structure

NIRS/
β”œβ”€β”€ configs/                    # Configuration files for experiments
β”‚   β”œβ”€β”€ pls_snv_savgol.yaml     # PLS model with SNV and Savitzky-Golay
β”‚   β”œβ”€β”€ rf_msc_feature_selection.yaml # Random Forest with feature selection
β”‚   β”œβ”€β”€ xgb_genetic_algorithm.yaml    # XGBoost with genetic algorithm
β”‚   β”œβ”€β”€ rf_hyperparams_tuning.yaml    # Random Forest with hyperparameter tuning
β”‚   └── README.md               # Documentation for config files
β”œβ”€β”€ data/                       # Data directory
β”‚   β”œβ”€β”€ raw/                    # Raw input data files
β”‚   β”‚   └── Tomato_Viavi_Brix_model_pulp.csv # Example tomato NIR spectra dataset
β”‚   └── processed/              # Processed data files
β”œβ”€β”€ experiments/                # Experiment scripts
β”‚   β”œβ”€β”€ analyze_models.py       # Script for analyzing model performance
β”‚   β”œβ”€β”€ create_config.py        # Tool for creating experiment configs
β”‚   β”œβ”€β”€ experiment_manager.py   # Manages experiment execution
β”‚   β”œβ”€β”€ process_data.py         # Data processing utilities
β”‚   β”œβ”€β”€ process_tomato_data.py  # Tomato-specific data processing
β”‚   β”œβ”€β”€ run_experiment.py       # Main experiment runner
β”‚   β”œβ”€β”€ run_from_config.py      # Run experiments from config files
β”‚   β”œβ”€β”€ run_experiments.py      # Extended experiment runner
β”‚   β”œβ”€β”€ run_mlflow_server.py    # MLflow server launcher
β”‚   └── train_model.py          # Model training script
β”œβ”€β”€ images/                     # Images for documentation
β”‚   └── tomato.png              # Visualization of tomato NIR analysis
β”œβ”€β”€ mlruns/                     # MLflow experiment tracking data
β”œβ”€β”€ models/                     # Saved model files
β”œβ”€β”€ nirs_tomato/                # Main package
β”‚   β”œβ”€β”€ config.py               # Configuration utilities
β”‚   β”œβ”€β”€ data_processing/        # Data processing modules
β”‚   β”‚   β”œβ”€β”€ constants.py        # Constant definitions
β”‚   β”‚   β”œβ”€β”€ feature_selection.py # Feature selection methods
β”‚   β”‚   β”œβ”€β”€ pipeline/           # Pipeline implementations
β”‚   β”‚   β”œβ”€β”€ transformers.py     # Spectral transformers
β”‚   β”‚   └── utils.py            # Utility functions
β”‚   β”œβ”€β”€ modeling/               # Modeling and evaluation modules
β”‚   β”‚   β”œβ”€β”€ evaluation.py       # Model evaluation tools
β”‚   β”‚   β”œβ”€β”€ hyperparameter_tuning.py # Hyperparameter optimization
β”‚   β”‚   β”œβ”€β”€ model_factory.py    # Model creation factory
β”‚   β”‚   β”œβ”€β”€ regression_models.py # Regression model implementations
β”‚   β”‚   └── tracking.py         # MLflow experiment tracking
β”‚   └── __init__.py             # Package initialization
β”œβ”€β”€ results/                    # Experiment results and outputs
β”œβ”€β”€ tests/                      # Test files
β”‚   β”œβ”€β”€ test_data_processing/   # Tests for data processing
β”‚   └── test_modeling/          # Tests for modeling
β”œβ”€β”€ .gitignore                  # Git ignore file
β”œβ”€β”€ .flake8                     # Flake8 configuration
β”œβ”€β”€ pyproject.toml              # Project configuration
└── README.md                   # This file

πŸš€ Installation

Clone this repository and install the package using pip:

git clone https://github.com/yourusername/NIRS.git
cd NIRS
pip install -e ".[dev]"

πŸ“Š Usage

Quick Start

Here's a quick example to get you started with analyzing tomato NIR spectra:

from nirs_tomato.data_processing.transformers import SNVTransformer
from nirs_tomato.data_processing.pipeline.data_processing import preprocess_spectra
from nirs_tomato.modeling.model_factory import create_model
from nirs_tomato.modeling.evaluation import evaluate_regression_model
import pandas as pd
from sklearn.model_selection import train_test_split

# 1. Load your NIR data
df = pd.read_csv('data/raw/Tomato_Viavi_Brix_model_pulp.csv')

# 2. Process the spectral data
results = preprocess_spectra(
    df=df,
    target_column='Brix',
    transformers=[SNVTransformer()],
    exclude_columns=['Instrument Serial Number', 'Notes', 'Timestamp'],
    remove_outliers=False,
    verbose=True
)

# 3. Get processed features and target
X, y = results['X'], results['y']

# 4. Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Create and train a model
model = create_model("pls", n_components=10)
model.fit(X_train, y_train)

# 6. Evaluate the model
metrics, y_pred = evaluate_regression_model(model, X_test, y_test)
print(f"Model performance:")
print(f"  RΒ² score: {metrics['r2']:.4f}")
print(f"  RMSE: {metrics['rmse']:.4f}")
print(f"  MAE: {metrics['mae']:.4f}")

Data Processing

The package provides tools for data processing, including transformers for spectral data, utilities for data cleaning, and pipelines for complete data processing workflows.

from nirs_tomato.data_processing.transformers import SNVTransformer
from nirs_tomato.data_processing.pipeline.data_processing import preprocess_spectra
import pandas as pd

# Load data
df = pd.read_csv('data/raw/Tomato_Viavi_Brix_model_pulp.csv')

# Process data
results = preprocess_spectra(
    df=df,
    target_column='Brix',
    transformers=[SNVTransformer()],
    exclude_columns=['Instrument Serial Number', 'Notes', 'Timestamp'],
    remove_outliers=False,
    verbose=True
)

# Get processed features and target
X = results['X']
y = results['y']

Model Training

The package provides command-line scripts for model training, as well as Python functions for creating and evaluating models.

Using the command-line script

# Train a PLS model with SNV transformation
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model pls --transform snv

# Train an XGBoost model with MSC transformation and Savitzky-Golay filtering
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model xgb --transform msc --savgol --window_length 15 --polyorder 2 --tune_hyperparams

# Train a Random Forest model with feature selection
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model rf --transform snv --feature_selection vip --n_features 20

Using Python functions

from nirs_tomato.modeling.model_factory import create_model
from nirs_tomato.modeling.evaluation import evaluate_regression_model
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create model
model = create_model("pls", n_components=10)

# Train model
model.fit(X_train, y_train)

# Evaluate model
metrics, y_pred = evaluate_regression_model(model, X_test, y_test)
print(f"R2 score: {metrics['r2']:.4f}")

Running Experiments

The package provides multiple ways to run experiments:

1. Using Configuration Files (Recommended)

The simplest way to run experiments is using YAML configuration files:

# Run a single experiment from a config file
python experiments/run_from_config.py --config configs/pls_snv_savgol.yaml

# Run all experiments in the configs directory
python experiments/run_from_config.py --config_dir configs/

# Run an experiment with verbose output
python experiments/run_from_config.py --config configs/rf_msc_feature_selection.yaml

2. Creating Custom Configuration Files

You can create your own experiment configuration files using the create_config.py script:

# Create a new configuration file
python experiments/create_config.py --name my_experiment --data_path data/raw/Tomato_Viavi_Brix_model_pulp.csv --target_column Brix --model rf --transform snv --output configs/my_experiment.yaml

3. Programmatic Interface

You can also run experiments programmatically:

from experiments.experiment_manager import ExperimentManager
from nirs_tomato.config import ExperimentConfig

# Create experiment manager
manager = ExperimentManager()

# Run from config file
results = manager.run_from_config("configs/pls_snv_savgol.yaml")

# Or create a config object programmatically
config = ExperimentConfig.from_yaml("configs/pls_snv_savgol.yaml")
config.model.model_type = "rf"  # Change model type
config.data.transform = "msc"   # Change transformation

# Run with modified config
results = manager.run_from_config_object(config)

Data Visualization

The package includes functions for visualizing results:

import matplotlib.pyplot as plt
import numpy as np

# Plot regression results
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')
plt.xlabel('Actual Brix')
plt.ylabel('Predicted Brix')
plt.title('Predicted vs Actual Brix Values')
plt.tight_layout()
plt.savefig("results/regression_plot.png")

πŸ“ˆ Experiment Tracking with MLflow

The package integrates with MLflow for experiment tracking, which helps you organize and compare your experiments.

Setup and Installation

MLflow is included in the project dependencies. To start the MLflow tracking server:

# Start MLflow server with local storage
python experiments/run_mlflow_server.py --host 0.0.0.0 --port 5000

# Start with custom backend store
python experiments/run_mlflow_server.py --backend-store-uri sqlite:///mlflow.db

Running Experiments with MLflow

To enable MLflow tracking in your experiments, set mlflow.enabled: true in your configuration file:

# In your YAML config file
mlflow:
  enabled: true
  experiment_name: nirs-tomato-brix

Or enable it programmatically:

from nirs_tomato.modeling.tracking import start_run, log_parameters, log_metrics, log_model, end_run

# Start a run
with start_run(experiment_name="nirs-tomato-test"):
    # Log parameters
    log_parameters({"model_type": "pls", "n_components": 10})
    
    # Train your model
    # ...
    
    # Log metrics
    log_metrics({"r2": 0.95, "rmse": 0.05})
    
    # Log model
    log_model(model, "pls_model")

Viewing Results via MLflow UI

Once your MLflow server is running, you can view your experiments in the MLflow UI:

  1. Open your browser and navigate to http://localhost:5000 (or your custom host:port)
  2. Browse experiments by name
  3. Compare runs, view metrics, and download models

πŸ§ͺ Testing

Running Tests

Run the test suite using pytest:

# Run all tests
pytest

# Run with coverage report
pytest --cov=nirs_tomato

# Run specific test module
pytest tests/test_data_processing/test_transformers.py

Continuous Integration

This project uses GitHub Actions for continuous integration. The CI pipeline:

  1. Runs tests on multiple Python versions
  2. Ensures code quality with ruff and other linters
  3. Generates coverage reports

πŸ“ License

This project is licensed under the MIT License.


Built for tomato quality analysis using NIR spectroscopy

About

Advanced ML pipeline for tomato quality assessment using NIR spectroscopy. Features preprocessing tools, multiple regression models, and experiment tracking.

Resources

Stars

Watchers

Forks

Contributors

Languages