🍅 NIRS - NIR Spectroscopy Analysis for Tomatoes

Analysis of tomato quality using Near-Infrared Spectroscopy (NIR)

📑 Table of Contents

✨ Features
📂 Project Structure
🚀 Installation
📊 Usage
📈 Experiment Tracking with MLflow
🧪 Testing
- Running Tests
- Continuous Integration
📝 License

✨ Features

🔍 Pre-processing of NIR spectral data:
- Spectral transformations (SNV, MSC)
- Savitzky-Golay filtering
- Automatic detection and filtering of non-numeric columns
- Outlier detection and removal
🧠 Modeling of NIR data:
- PLS regression
- Support Vector Regression (SVR)
- Random Forest regression
- XGBoost regression
- LightGBM regression
📊 Advanced model optimization:
- Hyperparameter tuning with Optuna
- Feature selection methods (Genetic Algorithm, CARS, VIP)
- Integrated cross-validation
📈 Experiment tracking with MLflow:
- Parameter logging
- Metrics tracking
- Model artifacts storage
- Feature importance visualization
🧪 Quality assurance:
- Comprehensive test suite
- Code quality checks with ruff and black

📂 Project Structure

NIRS/
├── configs/                    # Configuration files for experiments
│   ├── pls_snv_savgol.yaml     # PLS model with SNV and Savitzky-Golay
│   ├── rf_msc_feature_selection.yaml # Random Forest with feature selection
│   ├── xgb_genetic_algorithm.yaml    # XGBoost with genetic algorithm
│   ├── rf_hyperparams_tuning.yaml    # Random Forest with hyperparameter tuning
│   └── README.md               # Documentation for config files
├── data/                       # Data directory
│   ├── raw/                    # Raw input data files
│   │   └── Tomato_Viavi_Brix_model_pulp.csv # Example tomato NIR spectra dataset
│   └── processed/              # Processed data files
├── experiments/                # Experiment scripts
│   ├── analyze_models.py       # Script for analyzing model performance
│   ├── create_config.py        # Tool for creating experiment configs
│   ├── experiment_manager.py   # Manages experiment execution
│   ├── process_data.py         # Data processing utilities
│   ├── process_tomato_data.py  # Tomato-specific data processing
│   ├── run_experiment.py       # Main experiment runner
│   ├── run_from_config.py      # Run experiments from config files
│   ├── run_experiments.py      # Extended experiment runner
│   ├── run_mlflow_server.py    # MLflow server launcher
│   └── train_model.py          # Model training script
├── images/                     # Images for documentation
│   └── tomato.png              # Visualization of tomato NIR analysis
├── mlruns/                     # MLflow experiment tracking data
├── models/                     # Saved model files
├── nirs_tomato/                # Main package
│   ├── config.py               # Configuration utilities
│   ├── data_processing/        # Data processing modules
│   │   ├── constants.py        # Constant definitions
│   │   ├── feature_selection.py # Feature selection methods
│   │   ├── pipeline/           # Pipeline implementations
│   │   ├── transformers.py     # Spectral transformers
│   │   └── utils.py            # Utility functions
│   ├── modeling/               # Modeling and evaluation modules
│   │   ├── evaluation.py       # Model evaluation tools
│   │   ├── hyperparameter_tuning.py # Hyperparameter optimization
│   │   ├── model_factory.py    # Model creation factory
│   │   ├── regression_models.py # Regression model implementations
│   │   └── tracking.py         # MLflow experiment tracking
│   └── __init__.py             # Package initialization
├── results/                    # Experiment results and outputs
├── tests/                      # Test files
│   ├── test_data_processing/   # Tests for data processing
│   └── test_modeling/          # Tests for modeling
├── .gitignore                  # Git ignore file
├── .flake8                     # Flake8 configuration
├── pyproject.toml              # Project configuration
└── README.md                   # This file

🚀 Installation

Clone this repository and install the package using pip:

git clone https://github.com/yourusername/NIRS.git
cd NIRS
pip install -e ".[dev]"

📊 Usage

Quick Start

Here's a quick example to get you started with analyzing tomato NIR spectra:

from nirs_tomato.data_processing.transformers import SNVTransformer
from nirs_tomato.data_processing.pipeline.data_processing import preprocess_spectra
from nirs_tomato.modeling.model_factory import create_model
from nirs_tomato.modeling.evaluation import evaluate_regression_model
import pandas as pd
from sklearn.model_selection import train_test_split

# 1. Load your NIR data
df = pd.read_csv('data/raw/Tomato_Viavi_Brix_model_pulp.csv')

# 2. Process the spectral data
results = preprocess_spectra(
    df=df,
    target_column='Brix',
    transformers=[SNVTransformer()],
    exclude_columns=['Instrument Serial Number', 'Notes', 'Timestamp'],
    remove_outliers=False,
    verbose=True
)

# 3. Get processed features and target
X, y = results['X'], results['y']

# 4. Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Create and train a model
model = create_model("pls", n_components=10)
model.fit(X_train, y_train)

# 6. Evaluate the model
metrics, y_pred = evaluate_regression_model(model, X_test, y_test)
print(f"Model performance:")
print(f"  R² score: {metrics['r2']:.4f}")
print(f"  RMSE: {metrics['rmse']:.4f}")
print(f"  MAE: {metrics['mae']:.4f}")

Data Processing

The package provides tools for data processing, including transformers for spectral data, utilities for data cleaning, and pipelines for complete data processing workflows.

from nirs_tomato.data_processing.transformers import SNVTransformer
from nirs_tomato.data_processing.pipeline.data_processing import preprocess_spectra
import pandas as pd

# Load data
df = pd.read_csv('data/raw/Tomato_Viavi_Brix_model_pulp.csv')

# Process data
results = preprocess_spectra(
    df=df,
    target_column='Brix',
    transformers=[SNVTransformer()],
    exclude_columns=['Instrument Serial Number', 'Notes', 'Timestamp'],
    remove_outliers=False,
    verbose=True
)

# Get processed features and target
X = results['X']
y = results['y']

Model Training

The package provides command-line scripts for model training, as well as Python functions for creating and evaluating models.

Using the command-line script

# Train a PLS model with SNV transformation
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model pls --transform snv

# Train an XGBoost model with MSC transformation and Savitzky-Golay filtering
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model xgb --transform msc --savgol --window_length 15 --polyorder 2 --tune_hyperparams

# Train a Random Forest model with feature selection
python experiments/train_model.py --data data/raw/Tomato_Viavi_Brix_model_pulp.csv --target Brix --model rf --transform snv --feature_selection vip --n_features 20

Using Python functions

from nirs_tomato.modeling.model_factory import create_model
from nirs_tomato.modeling.evaluation import evaluate_regression_model
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create model
model = create_model("pls", n_components=10)

# Train model
model.fit(X_train, y_train)

# Evaluate model
metrics, y_pred = evaluate_regression_model(model, X_test, y_test)
print(f"R2 score: {metrics['r2']:.4f}")

Running Experiments

The package provides multiple ways to run experiments:

1. Using Configuration Files (Recommended)

The simplest way to run experiments is using YAML configuration files:

# Run a single experiment from a config file
python experiments/run_from_config.py --config configs/pls_snv_savgol.yaml

# Run all experiments in the configs directory
python experiments/run_from_config.py --config_dir configs/

# Run an experiment with verbose output
python experiments/run_from_config.py --config configs/rf_msc_feature_selection.yaml

2. Creating Custom Configuration Files

You can create your own experiment configuration files using the create_config.py script:

# Create a new configuration file
python experiments/create_config.py --name my_experiment --data_path data/raw/Tomato_Viavi_Brix_model_pulp.csv --target_column Brix --model rf --transform snv --output configs/my_experiment.yaml

3. Programmatic Interface

You can also run experiments programmatically:

from experiments.experiment_manager import ExperimentManager
from nirs_tomato.config import ExperimentConfig

# Create experiment manager
manager = ExperimentManager()

# Run from config file
results = manager.run_from_config("configs/pls_snv_savgol.yaml")

# Or create a config object programmatically
config = ExperimentConfig.from_yaml("configs/pls_snv_savgol.yaml")
config.model.model_type = "rf"  # Change model type
config.data.transform = "msc"   # Change transformation

# Run with modified config
results = manager.run_from_config_object(config)

Data Visualization

The package includes functions for visualizing results:

import matplotlib.pyplot as plt
import numpy as np

# Plot regression results
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r--')
plt.xlabel('Actual Brix')
plt.ylabel('Predicted Brix')
plt.title('Predicted vs Actual Brix Values')
plt.tight_layout()
plt.savefig("results/regression_plot.png")

📈 Experiment Tracking with MLflow

The package integrates with MLflow for experiment tracking, which helps you organize and compare your experiments.

Setup and Installation

MLflow is included in the project dependencies. To start the MLflow tracking server:

# Start MLflow server with local storage
python experiments/run_mlflow_server.py --host 0.0.0.0 --port 5000

# Start with custom backend store
python experiments/run_mlflow_server.py --backend-store-uri sqlite:///mlflow.db

Running Experiments with MLflow

To enable MLflow tracking in your experiments, set mlflow.enabled: true in your configuration file:

# In your YAML config file
mlflow:
  enabled: true
  experiment_name: nirs-tomato-brix

Or enable it programmatically:

from nirs_tomato.modeling.tracking import start_run, log_parameters, log_metrics, log_model, end_run

# Start a run
with start_run(experiment_name="nirs-tomato-test"):
    # Log parameters
    log_parameters({"model_type": "pls", "n_components": 10})
    
    # Train your model
    # ...
    
    # Log metrics
    log_metrics({"r2": 0.95, "rmse": 0.05})
    
    # Log model
    log_model(model, "pls_model")

Viewing Results via MLflow UI

Once your MLflow server is running, you can view your experiments in the MLflow UI:

Open your browser and navigate to http://localhost:5000 (or your custom host:port)
Browse experiments by name
Compare runs, view metrics, and download models

🧪 Testing

Running Tests

Run the test suite using pytest:

# Run all tests
pytest

# Run with coverage report
pytest --cov=nirs_tomato

# Run specific test module
pytest tests/test_data_processing/test_transformers.py

Continuous Integration

This project uses GitHub Actions for continuous integration. The CI pipeline:

Runs tests on multiple Python versions
Ensures code quality with ruff and other linters
Generates coverage reports

📝 License

This project is licensed under the MIT License.

Built for tomato quality analysis using NIR spectroscopy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍅 NIRS - NIR Spectroscopy Analysis for Tomatoes

📑 Table of Contents

✨ Features

📂 Project Structure

🚀 Installation

📊 Usage

Quick Start

Data Processing

Model Training

Using the command-line script

Using Python functions

Running Experiments

1. Using Configuration Files (Recommended)

2. Creating Custom Configuration Files

3. Programmatic Interface

Data Visualization

📈 Experiment Tracking with MLflow

Setup and Installation

Running Experiments with MLflow

Viewing Results via MLflow UI

🧪 Testing

Running Tests

Continuous Integration

📝 License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
configs		configs
data		data
experiments		experiments
images		images
mlruns		mlruns
models		models
nirs_tomato		nirs_tomato
tests		tests
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
fix_fstrings.py		fix_fstrings.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🍅 NIRS - NIR Spectroscopy Analysis for Tomatoes

📑 Table of Contents

✨ Features

📂 Project Structure

🚀 Installation

📊 Usage

Quick Start

Data Processing

Model Training

Using the command-line script

Using Python functions

Running Experiments

1. Using Configuration Files (Recommended)

2. Creating Custom Configuration Files

3. Programmatic Interface

Data Visualization

📈 Experiment Tracking with MLflow

Setup and Installation

Running Experiments with MLflow

Viewing Results via MLflow UI

🧪 Testing

Running Tests

Continuous Integration

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages