Gaia Miner Setup Guide

The definitive guide for setting up and running a Gaia miner that participates in Geomagnetic, Soil Moisture, and Weather forecasting tasks.

Quick Start
Weather Task Setup
Complete Configuration Reference
Critical Variable Names
Task Descriptions
Troubleshooting
Hardware Requirements

Quick Start

1. Environment Configuration

Create a .env file in your miner directory:

# --- Basic Miner Configuration ---
WALLET_NAME=<YOUR_WALLET_NAME>
HOTKEY_NAME=<YOUR_HOTKEY_NAME>
NETUID=<NETUID>  # 57 for mainnet, 237 for testnet
SUBTENSOR_NETWORK=<NETWORK>  # finney or test
MIN_STAKE_THRESHOLD=<STAKE>  # 10000 for mainnet, 0 for testnet

# --- Database Configuration ---
DB_USER=postgres
DB_PASSWORD=postgres
DB_HOST=localhost
DB_PORT=5432
DB_NAME=miner_db
DB_TARGET=miner
DB_CONNECTION_TYPE=socket
ALEMBIC_AUTO_UPGRADE=True

# --- Network Configuration ---
PUBLIC_PORT=33333  # Port posted to the chain
PORT=33334         # Internal port the miner listens on
EXTERNAL_IP="your_external_ip_address"

# --- General Settings ---
MINER_LOGGING_LEVEL=INFO
ENV=prod
MINER_JWT_SECRET_KEY=<GENERATE_WITH_SCRIPT>  # See generation instructions below

2. Generate JWT Secret Key

cd /root/Gaia  # or your project root
python gaia/miner/utils/generate_jwt_secret.py

3. Run the Miner

cd gaia/miner
python miner.py

# Or with PM2:
pm2 start --name miner --instances 1 python -- gaia/miner/miner.py

Weather Task Setup

The Weather Task is opt-in due to high computational requirements and is disabled by default.

Basic Weather Configuration

# Enable weather task
WEATHER_MINER_ENABLED=True

# Storage directories
MINER_FORECAST_DIR=/root/Gaia/miner_forecasts_background
MINER_GFS_ANALYSIS_CACHE_DIR="./gfs_analysis_cache_miner"

# File serving mode
WEATHER_FILE_SERVING_MODE=local  # or "r2_proxy"

Inference Options

Option 1: HTTP Inference Service (Recommended)

Best for most users - uses remote GPU infrastructure:

WEATHER_INFERENCE_TYPE=http_service
WEATHER_INFERENCE_SERVICE_URL="http://localhost:8000/run_inference"

# ⚠️ CRITICAL: R2 Storage - Use these EXACT variable names ⚠️
R2_ENDPOINT_URL=https://<ACCOUNT_ID>.r2.cloudflarestorage.com
R2_BUCKET=<YOUR_R2_BUCKET_NAME>
R2_ACCESS_KEY=<YOUR_R2_ACCESS_KEY_ID>
R2_SECRET_ACCESS_KEY=<YOUR_R2_SECRET_ACCESS_KEY>

# Inference Service API Key
INFERENCE_SERVICE_API_KEY=<YOUR_API_KEY>

Option 2: Local Inference (Requires GPU)

For users with powerful local hardware:

WEATHER_INFERENCE_TYPE=local

# Hardware Requirements:
# - NVIDIA GPU with 24GB+ VRAM (RTX 3090, RTX 4090, A5000+)
# - 32GB+ system RAM
# - 500GB+ free storage

Option 3: Azure Foundry Inference

For cloud-based inference:

WEATHER_INFERENCE_TYPE=azure_foundry
FOUNDRY_ENDPOINT_URL=<YOUR_AZURE_ENDPOINT>
FOUNDRY_ACCESS_TOKEN=<YOUR_AZURE_TOKEN>
BLOB_URL_WITH_RW_SAS=<YOUR_AZURE_BLOB_SAS_URL>

Weather File Serving Modes

Local Storage Mode (Default)

WEATHER_FILE_SERVING_MODE=local

Downloads forecast files from R2 to local storage
Serves files directly via HTTP/zarr
Pros: Faster validator access, original zarr design
Cons: Requires more local storage space

R2 Proxy Mode

WEATHER_FILE_SERVING_MODE=r2_proxy

Streams files from R2 on-demand without local storage
Miner acts as a proxy between validators and R2
Pros: Minimal storage requirements, R2 credentials stay private
Cons: Higher network usage, slight latency for validator requests

Complete Configuration Reference

See miner_template.env for a complete template with all possible configuration options.

Required Variables

# Basic identification
WALLET_NAME=<YOUR_WALLET_NAME>
HOTKEY_NAME=<YOUR_HOTKEY_NAME>
NETUID=<NETUID>
SUBTENSOR_NETWORK=<NETWORK>

# Essential security
MINER_JWT_SECRET_KEY=<GENERATED_SECRET>

# Network ports
PUBLIC_PORT=33333
PORT=33334

Database Configuration

DB_USER=postgres
DB_PASSWORD=postgres
DB_HOST=localhost
DB_PORT=5432
DB_NAME=miner_db
DB_TARGET=miner
DB_CONNECTION_TYPE=socket  # or "tcp"
ALEMBIC_AUTO_UPGRADE=True

Weather Task Variables (Optional)

# Core weather configuration
WEATHER_MINER_ENABLED=False  # Set to True to enable
WEATHER_INFERENCE_TYPE=http_service
MINER_FORECAST_DIR=/root/Gaia/miner_forecasts_background
WEATHER_FILE_SERVING_MODE=local

# R2 Storage (for HTTP inference)
R2_ENDPOINT_URL=https://<ACCOUNT_ID>.r2.cloudflarestorage.com
R2_BUCKET=<BUCKET_NAME>
R2_ACCESS_KEY=<ACCESS_KEY>
R2_SECRET_ACCESS_KEY=<SECRET_KEY>
INFERENCE_SERVICE_API_KEY=<API_KEY>

# HTTP Service URL
WEATHER_INFERENCE_SERVICE_URL="http://localhost:8000/run_inference"

Critical Variable Names

⚠️ MUST Use Exact Names

The following variable names must match exactly what the code expects:

R2 Storage:

✅ R2_BUCKET (NOT R2_BUCKET_NAME)
✅ R2_ACCESS_KEY (NOT R2_ACCESS_KEY_ID)
✅ R2_SECRET_ACCESS_KEY (correct)
✅ R2_ENDPOINT_URL (correct)

Port Configuration:

✅ PORT=33334 (NOT INTERNAL_PORT)
✅ PUBLIC_PORT=33333 (correct)

API Keys:

✅ INFERENCE_SERVICE_API_KEY (primary)
✅ WEATHER_RUNPOD_API_KEY (fallback)

Common Variable Name Mistakes

❌ WRONG:

R2_BUCKET_NAME=my-bucket      # Wrong!
R2_ACCESS_KEY_ID=my-key       # Wrong!
INTERNAL_PORT=33334           # Wrong!

✅ CORRECT:

R2_BUCKET=my-bucket           # Correct
R2_ACCESS_KEY=my-key          # Correct  
PORT=33334                    # Correct

Verification Commands

Check your configuration:

# Verify R2 variable names
grep -E "R2_BUCKET|R2_ACCESS_KEY" .env
# Should show R2_BUCKET= and R2_ACCESS_KEY= (without suffixes)

# Check for deprecated variables
grep -E "INTERNAL_PORT|R2_BUCKET_NAME|R2_ACCESS_KEY_ID" .env
# Should return no matches (these are wrong)

# Verify port configuration
grep "PORT=" .env
# Should show PORT=33334

Task Descriptions

Geomagnetic Task

Purpose: Forecast the DST (Disturbance Storm Time) index to predict geomagnetic disturbances affecting GPS, communications, and power grids.

Data Sources:

Hourly DST index values from validators
Optional historical DST data for model improvement

Output:

Predicted DST value for the next hour
UTC timestamp of the last observation

Process:

Receive cleaned DataFrame with timestamp and DST values
Process historical data if available
Generate prediction using GeomagneticPreprocessing
Return prediction and timestamp

Soil Moisture Task

Purpose: Predict soil moisture levels using satellite imagery and weather data to support agriculture and environmental monitoring.

Data Sources:

Sentinel-2 satellite imagery
IFS weather forecast data
SMAP soil moisture data (for scoring)
SRTM elevation data
NDVI vegetation indices

Process:

Region Selection: Choose analysis regions avoiding urban/water areas
Data Retrieval: Gather multi-source datasets via APIs
Data Compilation: Create .tiff files with band order: [Sentinel-2, IFS, SRTM, NDVI]
Model Inference: Process through soil_model.py
Validation: Compare predictions against ground truth SMAP data

IFS Weather Variables (in order):

t2m: Surface air temperature (2m) [Kelvin]
tp: Total precipitation [m/day]
ssrd: Surface solar radiation downwards [J/m²]
st: Surface soil temperature [Kelvin]
stl2/stl3: Soil temperature at 2m/3m depth [Kelvin]
sp: Surface pressure [Pascals]
d2m: Dewpoint temperature [Kelvin]
u10/v10: Wind components at 10m [m/s]
ro: Total runoff [m/day]
msl: Mean sea level pressure [Pascals]
et0: Reference evapotranspiration [mm/day]
bare_soil_evap: Bare soil evaporation [mm/day]
svp: Saturated vapor pressure [kPa]
avp: Actual vapor pressure [kPa]
r_n: Net radiation [MJ/m²/day]

Weather Task

Purpose: Generate detailed weather forecasts using the Microsoft Aurora model for meteorological prediction.

Key Features:

40-step forecasts at 6-hour intervals (10-day forecasts)
Zarr-based output format for efficient data access
Multiple inference backends (local, HTTP service, Azure Foundry)
Configurable file serving (local storage vs R2 proxy)
Comprehensive verification and scoring systems

Workflow:

Data Reception: Receive GFS initialization data from validators
Data Processing: Convert GFS data into Aurora-compatible format
Inference: Run multi-step forecast generation (local or remote)
Output Generation: Create Zarr stores with forecast data
File Serving: Serve data to validators via HTTP/zarr or R2 proxy
Verification: Enable validator verification and scoring

Architecture Comparison:

Local Storage Mode:

RunPod → R2 Upload → Download to Miner → Serve Local Zarr → Validator

R2 Proxy Mode:

RunPod → R2 Upload → Miner Proxy → Validator
                        ↑
                   (Streams from R2)

Hardware Requirements

Basic Miner (Geomagnetic + Soil Moisture)

CPU: 4+ cores
RAM: 8GB minimum
Storage: 50GB+ for databases and caching
Network: Stable broadband internet

Weather Task Local Inference

GPU: NVIDIA with 24GB+ VRAM
- Recommended: RTX 3090, RTX 4090, A5000, A6000, H100
CPU: 8+ cores (16+ recommended)
RAM: 32GB+ system memory (64GB recommended)
Storage: 500GB+ fast storage (NVMe SSD preferred)
- GFS cache: ~50GB
- Forecast outputs: ~100GB (local mode)
- Model weights: ~10GB
Network: High-speed internet for GFS downloads (multi-GB files)

Weather Task HTTP Service

CPU: 4+ cores
RAM: 16GB+
Storage: 100GB+ (for local mode) or 20GB+ (for R2 proxy mode)
Network: Stable high-speed internet
External GPU: Access to RunPod or similar GPU service

Troubleshooting

Common Issues

Weather Task Not Starting

Symptoms:

Weather task DISABLED for this miner

Solutions:

Set WEATHER_MINER_ENABLED=True
Generate JWT secret: python gaia/miner/utils/generate_jwt_secret.py
Ensure forecast directory exists and is writable
Restart miner after configuration changes
For local inference: verify GPU with nvidia-smi

R2 Connection Errors

Symptoms:

R2 client configuration is incomplete
R2 connection failed

Solutions:

Verify exact variable names: R2_BUCKET, R2_ACCESS_KEY
Check R2 credentials and permissions
Ensure endpoint URL includes account ID
Test R2 connectivity independently

API Key Issues

Symptoms:

No RunPod API Key found
Authentication failed

Solutions:

Use INFERENCE_SERVICE_API_KEY (primary)
Or WEATHER_RUNPOD_API_KEY (fallback)
Ensure API key matches inference service configuration
Check for typos or extra spaces

Database Connection Issues

Symptoms:

Database connection failed
psycopg2.OperationalError

Solutions:

Verify PostgreSQL is running: sudo systemctl status postgresql
Check database credentials in .env
Ensure database exists: createdb miner_db
Test connection manually

Out of Memory Errors (Local Weather)

Symptoms:

CUDA out of memory
RuntimeError: CUDA error

Solutions:

Ensure 24GB+ GPU VRAM available
Check no other processes using GPU: nvidia-smi
Monitor system RAM usage
Consider switching to HTTP service or Azure Foundry
Reduce batch size if using custom configurations

Port Connection Issues

Symptoms:

Connection refused
Port already in use

Solutions:

Use PORT=33334 (not INTERNAL_PORT)
Check port availability: netstat -tlnp | grep 33334
Ensure nginx forwards correctly to internal port
Check firewall settings

Expected Log Messages

Successful Startup:

Weather task ENABLED for this miner (WEATHER_MINER_ENABLED=True)
Weather routes registered (weather task is enabled)
Weather file serving mode: local
RunPod API Key loaded from INFERENCE_SERVICE_API_KEY env var
Miner started successfully on port 33334

Disabled Weather Task:

Weather task DISABLED for this miner. Set WEATHER_MINER_ENABLED=True to enable.
Weather routes NOT registered (weather task is disabled)

Migration from Old Configuration

If updating from older documentation:

# Backup existing config
cp .env .env.backup

# Fix variable names
sed -i 's/R2_BUCKET_NAME=/R2_BUCKET=/g' .env
sed -i 's/R2_ACCESS_KEY_ID=/R2_ACCESS_KEY=/g' .env
sed -i 's/INTERNAL_PORT=/PORT=/g' .env

# Add new features
echo "WEATHER_FILE_SERVING_MODE=local" >> .env

# Verify changes
grep -E "R2_BUCKET|R2_ACCESS_KEY|PORT=" .env

Getting Help

Check Logs: Review miner logs for specific error messages
Verify Configuration: Use verification commands above
Test Components: Ensure all services (PostgreSQL, inference service) are running
Network Connectivity: Test external service connectivity
Hardware Check: Verify GPU availability for local inference

Performance Monitoring

System Resources:

# GPU usage (for local weather inference)
nvidia-smi

# System memory
free -h

# Disk space (critical for weather task)
df -h

# Process monitoring
htop

Miner-Specific:

Monitor forecast directory size growth
Check GFS cache usage
Review database size
Track network bandwidth during GFS downloads

Security Notes

Credential Management

Never commit .env files to version control
Rotate keys regularly, especially R2 and API keys
Use strong passwords for database and JWT secrets
Limit R2 permissions to minimum required (read/write to specific bucket)

Network Security

Configure firewall to allow only necessary ports
Use HTTPS for all external communications
Monitor access logs for unusual activity
Keep systems updated with security patches

Best Practices

Regular backups of configuration and database
Monitor logs for security events
Test disaster recovery procedures
Document access procedures for team members

Summary

This guide provides everything needed to set up and run a Gaia miner:

Quick Setup: Basic configuration for immediate functionality
Weather Task: Comprehensive opt-in weather forecasting
Variable Names: Critical exact naming requirements
Task Details: Complete description of all supported tasks
Troubleshooting: Solutions for common issues

The key to success is using the exact variable names expected by the code and following the configuration templates provided. All documentation is now aligned with the current codebase to prevent configuration failures.

Advanced: Custom Models & Inference Service

Custom Models

Miners can create custom models for improved performance:

File Structure

gaia/models/custom_models/
├── custom_soil_model.py           # CustomSoilModel class
├── custom_geomagnetic_model.py    # CustomGeomagneticModel class
└── custom_weather_model.py        # CustomWeatherModel class (future)

Requirements

Exact class names: CustomSoilModel, CustomGeomagneticModel
Required method: run_inference() with specific input/output formats
Soil moisture output: 11x11 arrays for surface/rootzone (0-1 range)
Geomagnetic output: Next-hour DST prediction with UTC timestamp

Weather Inference Service Setup

The Weather Inference Service provides remote GPU-based inference for weather forecasting. This section covers complete setup from Docker building to cloud deployment.

Overview

The inference service is a FastAPI-based application that:

Receives weather data from miners via HTTP API
Runs Aurora model inference on GPU hardware
Uploads results to R2 storage for miner access
Supports both RunPod serverless and dedicated deployments

Architecture

Miner → HTTP Request → Inference Service → Aurora Model → R2 Upload → Response

Key Components:

FastAPI Server: Handles HTTP requests and authentication
Aurora Model: Microsoft's weather prediction model
R2 Storage: Cloudflare R2 for forecast data storage
Docker Container: Portable deployment environment

Prerequisites

Required Accounts & Services

Cloudflare R2 Storage

# Create R2 bucket for forecast storage
# Get R2 credentials: Account ID, Access Key, Secret Key

GPU Infrastructure (Choose one):
- RunPod: Serverless GPU platform (recommended)
- Vast.ai: GPU rental marketplace
- AWS/GCP/Azure: Cloud GPU instances
- Local GPU: NVIDIA GPU with 24GB+ VRAM

Docker Environment

# Install Docker and Docker Compose
sudo apt update
sudo apt install docker.io docker-compose
sudo usermod -aG docker $USER

Hardware Requirements

Minimum GPU Requirements:

VRAM: 24GB+ (RTX 3090, RTX 4090, A5000, A6000, H100)
CUDA: 11.8+ or 12.x
Memory: 32GB+ system RAM
Storage: 50GB+ for model and temporary files

Step 1: Configure the Inference Service

1.1 Configuration Files

Navigate to the inference service directory:

cd gaia/miner/inference_service

1.2 Edit Settings (`config/settings.yaml`)

model:
  # Aurora model configuration
  model_repo: "/app/models/aurora_local"  # Local path in container
  checkpoint: "aurora-0.25-pretrained.ckpt"
  device: "auto"
  inference_steps: 40
  forecast_step_hours: 6
  resolution: "0.25"

api:
  port: 8000
  host: "0.0.0.0"

logging:
  level: "INFO"

1.3 Environment Variables

Create .env file for local testing:

# API Authentication
INFERENCE_SERVICE_API_KEY=your_secure_api_key_here

# R2 Storage Configuration
R2_BUCKET=your-weather-forecasts-bucket
R2_ENDPOINT_URL=https://your-account-id.r2.cloudflarestorage.com
R2_ACCESS_KEY=your_r2_access_key
R2_SECRET_ACCESS_KEY=your_r2_secret_key

# Logging
LOG_LEVEL=INFO

# Optional: Custom model paths
# CUSTOM_MODEL_PATH=/app/local_models/custom_aurora

Step 2: Build Docker Image

2.1 Standard Build

cd gaia/miner/inference_service

# Build the Docker image
docker build -t weather-inference-service:latest .

# Verify build success
docker images | grep weather-inference-service

2.2 Build with Custom Aurora Model

If you have a custom Aurora model:

# 1. Create local model directory
mkdir -p local_models/custom_aurora

# 2. Copy your model files
cp /path/to/your/custom_model.ckpt local_models/custom_aurora/
cp /path/to/your/config.json local_models/custom_aurora/

# 3. Uncomment the COPY line in Dockerfile
sed -i 's/# COPY \.\/local_models/COPY \.\/local_models/' Dockerfile

# 4. Update settings.yaml to point to custom model
sed -i 's|model_repo: "/app/models/aurora_local"|model_repo: "/app/local_models/custom_aurora"|' config/settings.yaml

# 5. Build with custom model
docker build -t weather-inference-service:custom .

2.3 Build Arguments

Customize the build process:

# Use different Aurora model version
docker build \
  --build-arg AURORA_MODEL_REPO="microsoft/aurora" \
  --build-arg AURORA_CHECKPOINT_NAME="aurora-0.25-pretrained.ckpt" \
  -t weather-inference-service:latest .

# Build for specific CUDA version
docker build \
  --build-arg CUDA_VERSION="11.8" \
  -t weather-inference-service:cuda118 .

Step 3: Local Testing

3.1 Run Locally with GPU

# Run with GPU support
docker run --gpus all \
  -p 8000:8000 \
  --env-file .env \
  weather-inference-service:latest

# Run with specific GPU
docker run --gpus '"device=0"' \
  -p 8000:8000 \
  --env-file .env \
  weather-inference-service:latest

3.2 Test Health Endpoint

# Check service health
curl http://localhost:8000/health

# Expected response:
{
  "status": "ok",
  "model_status": "loaded"
}

3.3 Test Inference Endpoint

# Test with sample data (requires valid API key)
curl -X POST http://localhost:8000/run_inference \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your_secure_api_key_here" \
  -d '{
    "input": {
      "action": "run_inference_from_r2",
      "input_r2_object_key": "test/sample_input.pkl",
      "job_run_uuid": "test-job-123"
    }
  }'

Step 4: RunPod Deployment

4.1 Push to Container Registry

Option A: Docker Hub

# Tag and push to Docker Hub
docker tag weather-inference-service:latest yourusername/weather-inference:latest
docker push yourusername/weather-inference:latest

Option B: GitHub Container Registry

# Login to GitHub Container Registry
echo $GITHUB_TOKEN | docker login ghcr.io -u yourusername --password-stdin

# Tag and push
docker tag weather-inference-service:latest ghcr.io/yourusername/weather-inference:latest
docker push ghcr.io/yourusername/weather-inference:latest

4.2 RunPod Serverless Setup

Create RunPod Account
- Sign up at runpod.io
- Add payment method and credits

Create Network Volume (Required)

# Navigate to Storage → Network Volumes → Create Volume

# Configuration:
Name: weather-data-volume
Size: 50GB (minimum required for input/output files)
Region: [same as your endpoint region]

Create Serverless Endpoint

# Navigate to Serverless → Endpoints → Create Endpoint

# Configuration:
Name: weather-inference-service
Container Image: yourusername/weather-inference:latest
Container Registry Credentials: [if private registry]
Network Volume: weather-data-volume → /workspace/data

Configure Environment Variables

# In RunPod Endpoint Settings → Environment Variables:
INFERENCE_SERVICE_API_KEY=your_secure_api_key_here
R2_BUCKET=your-weather-forecasts-bucket
R2_ENDPOINT_URL=https://your-account-id.r2.cloudflarestorage.com
R2_ACCESS_KEY=your_r2_access_key
R2_SECRET_ACCESS_KEY=your_r2_secret_key
LOG_LEVEL=INFO

GPU Configuration

# Recommended GPU types:
- RTX 3090 (24GB VRAM) - Cost effective
- RTX 4090 (24GB VRAM) - Faster inference
- A5000 (24GB VRAM) - Professional grade
- H100 (80GB VRAM) - Highest performance

# Container Configuration:
Container Disk: 50GB
Network Volume: 50GB (REQUIRED - for input/output file storage)

Deploy and Test

# Deploy the endpoint
# Copy the endpoint URL (e.g., https://api.runpod.ai/v2/your-endpoint-id)

# Test deployment
curl https://api.runpod.ai/v2/your-endpoint-id/health

4.3 RunPod Pod (Dedicated Instance)

For consistent availability, use a dedicated pod:

# Create Pod Configuration:
Template: Custom
Container Image: yourusername/weather-inference:latest
GPU: RTX 3090/4090 (24GB VRAM minimum)
Container Disk: 50GB
Volume Disk: 100GB
Ports: 8000 (HTTP)

# Environment Variables: [same as serverless]

# Startup Command:
python -u -m app.main

Step 5: Miner Configuration

5.1 Update Miner Environment

# In your miner .env file:
WEATHER_MINER_ENABLED=True
WEATHER_INFERENCE_TYPE=http_service

# RunPod Serverless Endpoint
WEATHER_INFERENCE_SERVICE_URL="https://api.runpod.ai/v2/your-endpoint-id/run"

# OR RunPod Pod (dedicated instance)
WEATHER_INFERENCE_SERVICE_URL="https://your-pod-id-8000.proxy.runpod.net/run_inference"

# API Key (must match inference service)
INFERENCE_SERVICE_API_KEY=your_secure_api_key_here

# R2 Configuration (for miner)
R2_ENDPOINT_URL=https://your-account-id.r2.cloudflarestorage.com
R2_BUCKET=your-weather-forecasts-bucket
R2_ACCESS_KEY=your_r2_access_key
R2_SECRET_ACCESS_KEY=your_r2_secret_key

5.2 Test Miner Integration

# Restart miner
cd gaia/miner
python miner.py

# Check logs for successful connection
tail -f logs/miner.log | grep -i "weather\|inference\|runpod"

# Expected log messages:
# "Weather task ENABLED for this miner"
# "RunPod API Key loaded from INFERENCE_SERVICE_API_KEY"
# "HTTP Inference Service URL is present"

Step 6: Monitoring & Maintenance

6.1 Health Monitoring

Automated Health Checks:

#!/bin/bash
# health_check.sh
ENDPOINT_URL="https://api.runpod.ai/v2/your-endpoint-id"
API_KEY="your_secure_api_key_here"

response=$(curl -s -w "%{http_code}" -o /tmp/health_response.json \
  -H "X-API-Key: $API_KEY" \
  "$ENDPOINT_URL/health")

if [ "$response" = "200" ]; then
  echo "✅ Inference service healthy"
else
  echo "❌ Inference service unhealthy (HTTP $response)"
  cat /tmp/health_response.json
fi

RunPod Monitoring:

# Monitor RunPod usage and costs
# Check endpoint logs in RunPod dashboard
# Set up billing alerts

6.2 Log Analysis

Common Log Patterns:

# Successful inference
"Successfully processed inference request"
"Uploaded forecast to R2"

# Authentication issues
"Invalid API key"
"Authentication failed"

# Model issues
"Model loading failed"
"CUDA out of memory"

# R2 issues
"R2 upload failed"
"R2 connection timeout"

6.3 Performance Optimization

GPU Optimization:

# In settings.yaml
model:
  device: "cuda"  # Force CUDA instead of auto
  batch_size: 1   # Adjust based on VRAM
  precision: "fp16"  # Use half precision if supported

R2 Optimization:

# Concurrent upload limits
r2:
  max_concurrent_uploads: 10
  upload_timeout_seconds: 300
  retry_attempts: 3

Step 7: Troubleshooting

7.1 Common Issues

Docker Build Failures:

# Issue: CUDA compatibility
# Solution: Use NVIDIA base image
FROM nvidia/cuda:11.8-runtime-ubuntu20.04

# Issue: Model download timeout
# Solution: Increase timeout or pre-download
RUN timeout 1800 python -c "from huggingface_hub import snapshot_download; ..."

RunPod Deployment Issues:

# Issue: Container won't start
# Check: Environment variables are set correctly
# Check: Container image is accessible
# Check: GPU requirements are met
# Check: Network volume is attached (required for serverless)

# Issue: Network volume not accessible
# Solution: Ensure 50GB network volume is created and attached
# Check: Volume mount path is /workspace/data
# Check: Volume is in same region as endpoint

# Issue: Out of memory
# Solution: Use larger GPU or optimize model
# Check: nvidia-smi output in container logs

API Connection Issues:

# Issue: Authentication failed
# Check: API key matches between miner and service
# Check: API key environment variable name

# Issue: Connection timeout
# Check: RunPod endpoint URL is correct
# Check: Network connectivity from miner

R2 Storage Issues:

# Issue: R2 upload failed
# Check: R2 credentials and permissions
# Check: Bucket exists and is accessible
# Test: Manual R2 connection with AWS CLI

# Test R2 connection:
aws s3 ls s3://your-bucket --endpoint-url=https://your-account-id.r2.cloudflarestorage.com

7.2 Debug Commands

Container Debugging:

# Run container interactively
docker run -it --gpus all --entrypoint /bin/bash weather-inference-service:latest

# Check GPU availability
nvidia-smi

# Test model loading
python -c "import torch; print(torch.cuda.is_available())"

# Check environment variables
env | grep -E "R2_|INFERENCE_"

Network Testing:

# Test endpoint connectivity
curl -v https://api.runpod.ai/v2/your-endpoint-id/health

# Test with authentication
curl -H "X-API-Key: your-key" https://api.runpod.ai/v2/your-endpoint-id/health

# Check DNS resolution
nslookup api.runpod.ai

7.3 Performance Monitoring

GPU Monitoring:

# Monitor GPU usage during inference
watch -n 1 nvidia-smi

# Check memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Cost Monitoring:

# RunPod cost tracking
# Monitor usage in RunPod dashboard
# Set up billing alerts
# Track inference requests per hour

Step 8: Advanced Configuration

8.1 Custom Aurora Models

Preparing Custom Models:

# 1. Train or fine-tune Aurora model
# 2. Save checkpoint in compatible format
# 3. Create model configuration

# Directory structure:
local_models/custom_aurora/
├── custom_model.ckpt
├── config.json
└── metadata.json

Docker Integration:

# Add to Dockerfile
COPY ./local_models/custom_aurora /app/local_models/custom_aurora

# Update settings.yaml
model:
  model_repo: "/app/local_models/custom_aurora"
  checkpoint: "custom_model.ckpt"

8.2 Multi-GPU Setup

For Multiple GPUs:

# settings.yaml
model:
  device: "cuda:0"  # Specify GPU
  multi_gpu: true
  gpu_ids: [0, 1]   # Use multiple GPUs

Docker Configuration:

# Run with multiple GPUs
docker run --gpus '"device=0,1"' \
  -p 8000:8000 \
  weather-inference-service:latest

8.3 Scaling & Load Balancing

Multiple Endpoints:

# Deploy multiple RunPod endpoints
# Use load balancer or round-robin in miner
WEATHER_INFERENCE_SERVICE_URL="https://api.runpod.ai/v2/endpoint-1/run,https://api.runpod.ai/v2/endpoint-2/run"

Auto-scaling:

# Configure RunPod auto-scaling
# Set min/max workers
# Configure scale-up/down policies

Step 9: Security Best Practices

9.1 API Key Management

# Generate secure API keys
python -c "import secrets; print(secrets.token_urlsafe(32))"

# Rotate keys regularly
# Use different keys for different environments
# Store keys securely (not in code)

9.2 Network Security

# Restrict access to inference service
# Use VPN or private networks when possible
# Monitor access logs
# Implement rate limiting

9.3 Container Security

# Use minimal base images
# Scan for vulnerabilities
docker scan weather-inference-service:latest

# Run as non-root user
USER 1000:1000

# Limit container capabilities
--cap-drop=ALL --cap-add=SYS_NICE

Step 10: Cost Optimization

10.1 RunPod Cost Management

Serverless vs Dedicated:

# Serverless: Pay per inference
# - Good for: Variable workload
# - Cost: $0.50-2.00 per hour of GPU time

# Dedicated Pod: Fixed hourly rate
# - Good for: Consistent workload
# - Cost: $0.30-1.50 per hour continuous

GPU Selection:

# Cost-effective options:
RTX 3090: ~$0.30/hour (24GB VRAM)
RTX 4090: ~$0.50/hour (24GB VRAM, faster)
A5000: ~$0.70/hour (24GB VRAM, professional)

# High-performance options:
A6000: ~$1.00/hour (48GB VRAM)
H100: ~$2.00/hour (80GB VRAM, fastest)

10.2 Storage Optimization

R2 Storage Costs:

# Storage: $0.015/GB/month
# Requests: $0.36/million requests
# Egress: Free (major advantage over S3)

# Optimization strategies:
# - Compress forecast data
# - Implement data lifecycle policies
# - Clean up old forecasts automatically

10.3 Monitoring & Alerts

# Set up cost alerts
# Monitor inference frequency
# Track storage usage
# Optimize based on usage patterns

This comprehensive guide covers everything needed to set up a production-ready weather inference service, from initial Docker build to advanced scaling and optimization strategies.

FilesExpand file tree

MINER.md

Latest commit

History

MINER.md

File metadata and controls

Gaia Miner Setup Guide

Table of Contents

Quick Start

1. Environment Configuration

2. Generate JWT Secret Key

3. Run the Miner

Weather Task Setup

Basic Weather Configuration

Inference Options

Option 1: HTTP Inference Service (Recommended)

Option 2: Local Inference (Requires GPU)

Option 3: Azure Foundry Inference

Weather File Serving Modes

Local Storage Mode (Default)

R2 Proxy Mode

Complete Configuration Reference

Required Variables

Database Configuration

Weather Task Variables (Optional)

Critical Variable Names

⚠️ MUST Use Exact Names

Common Variable Name Mistakes

Verification Commands

Task Descriptions

Geomagnetic Task

Soil Moisture Task

Weather Task

Hardware Requirements

Basic Miner (Geomagnetic + Soil Moisture)

Weather Task Local Inference

Weather Task HTTP Service

Troubleshooting

Common Issues

Weather Task Not Starting

R2 Connection Errors

API Key Issues

Database Connection Issues

Out of Memory Errors (Local Weather)

Port Connection Issues

Expected Log Messages

Migration from Old Configuration

Getting Help

Performance Monitoring

Security Notes

Credential Management

Network Security

Best Practices

Summary

Advanced: Custom Models & Inference Service

Custom Models

File Structure

Requirements

Weather Inference Service Setup

Overview

Architecture

Prerequisites

Required Accounts & Services

Hardware Requirements

Step 1: Configure the Inference Service

1.1 Configuration Files

1.2 Edit Settings (config/settings.yaml)

1.3 Environment Variables

Step 2: Build Docker Image

2.1 Standard Build

2.2 Build with Custom Aurora Model

2.3 Build Arguments

Step 3: Local Testing

3.1 Run Locally with GPU

3.2 Test Health Endpoint

3.3 Test Inference Endpoint

Step 4: RunPod Deployment

4.1 Push to Container Registry

4.2 RunPod Serverless Setup

4.3 RunPod Pod (Dedicated Instance)

1.2 Edit Settings (`config/settings.yaml`)