🎓 Training Guide

caramba supports multiple training paradigms, from simple end-to-end training to sophisticated architecture surgery with distillation. This guide covers all training modes and their configurations.

📋 Table of Contents

Overview
Standard Training
Upcycle Training
Orchestrated Training
Training Configuration
Distributed Training

Overview

caramba provides three training modes:

Mode	Trainer	Use Case
Standard	`trainer.standard`	Training from scratch or fine-tuning
Upcycle	`trainer.upcycle`	Architecture surgery + distillation
Orchestrated	Built into runs	Dynamic optimizer switching

Each mode is selected by the trainer field and configured in train:

targets:
  - type: experiment
    trainer: trainer.standard  # or trainer.upcycle
    runs:
      - id: train
        train:
          phase: standard
          orchestrator_enabled: false  # Enable for orchestrated mode

Standard Training

End-to-end training from scratch or fine-tuning an existing model.

Basic Configuration

trainer: trainer.standard

runs:
  - id: train
    mode: train
    steps: 10000
    train:
      phase: standard
      batch_size: 32
      block_size: 512
      lr: 0.0003
      device: mps
      dtype: float32

Full Options

train:
  phase: standard

  # Core settings
  batch_size: 32          # Training batch size
  block_size: 512         # Sequence length
  lr: 0.0003              # Learning rate
  device: mps             # Device: mps, cuda, cpu
  dtype: float32          # Data type: float32, float16, bfloat16

  # Optimizer settings
  weight_decay: 0.01      # Weight decay (AdamW)
  beta1: 0.9              # Adam beta1
  beta2: 0.95             # Adam beta2
  grad_clip: 1.0          # Gradient clipping norm

  # Learning rate schedule
  warmup_steps: 100       # LR warmup steps
  lr_schedule: cosine     # cosine, linear, constant
  min_lr: 0.00001         # Minimum LR for schedule

  # Optimization features
  use_amp: false          # Automatic mixed precision
  amp_dtype: float16      # AMP dtype
  compile_model: false    # torch.compile optimization
  gradient_accumulation_steps: 1

  # Data loading
  num_workers: 4          # DataLoader workers
  pin_memory: true        # Pin memory for GPU transfer

Example: Training a Small Transformer

version: 2
name: train_from_scratch

vars:
  d_model: 256
  n_heads: 4
  n_layers: 4

targets:
  - type: experiment
    name: baseline
    trainer: trainer.standard
    runs:
      - id: train
        mode: train
        steps: 5000
        train:
          phase: standard
          batch_size: 16
          block_size: 256
          lr: 0.001
          device: mps
          dtype: float32
          warmup_steps: 200
          lr_schedule: cosine

Run it:

python3 -m caramba my_training.yml

Upcycle Training

Architecture surgery that converts a pretrained model to a new architecture while preserving learned representations. This is caramba's flagship feature for attention surgery research.

The Upcycle Pipeline

┌──────────────────────────────────────────────────────────┐
│                    UPCYCLE PIPELINE                       │
├──────────────────────────────────────────────────────────┤
│  1. Load Teacher     Load pretrained checkpoint          │
│  2. Build Student    Create target architecture          │
│  3. Surgery          SVD-based weight initialization     │
│  4. Blockwise        Layer-by-layer distillation         │
│  5. Global           End-to-end fine-tuning              │
│  6. Verify           Compare teacher/student outputs     │
│  7. Benchmark        Measure quality/speed/memory        │
└──────────────────────────────────────────────────────────┘

Upcycle Configuration

trainer: trainer.upcycle

runs:
  # Phase 1: Blockwise distillation
  - id: blockwise
    mode: train
    steps: 500
    train:
      phase: blockwise
      teacher_ckpt: hf://meta-llama/Llama-3.2-1B
      batch_size: 1
      block_size: 2048
      lr: 0.0001
      device: mps
      dtype: float32

  # Phase 2: Global fine-tuning
  - id: finetune
    mode: train
    steps: 2000
    train:
      phase: global
      batch_size: 1
      block_size: 2048
      lr: 0.00005
      device: mps
      dtype: float32

Training Phases

Blockwise Phase

Trains each layer to match teacher outputs:

train:
  phase: blockwise
  teacher_ckpt: hf://meta-llama/Llama-3.2-1B

  # Convergence-based training (optional)
  convergence_target: 0.02      # Target L1 loss
  convergence_patience: 100     # Steps without improvement
  convergence_max_steps: 2000   # Max steps per block

  # Teacher output caching
  cache_teacher_outputs: true   # Cache for speed

Blockwise autopilot (self-tuning)

By default, blockwise uses a fixed LR and (optionally) convergence-based early stopping. If you want it to self-tune and to surface those decisions in console logs, enable the lightweight blockwise autotuner:

train:
  phase: blockwise
  blockwise_autotune_enabled: true
  blockwise_autotune_mode: monitor  # monitor|active

  # Tuning knobs (optional)
  blockwise_autotune_plateau_patience: 100
  blockwise_autotune_lr_decay: 0.5
  blockwise_autotune_min_lr: 1e-6
  blockwise_autotune_log_every: 50

monitor: logs spikes/plateaus and what it would do (no LR changes)
active: reduces LR on spikes/plateaus within each block

The blockwise phase:

Iterates through each transformer block
Runs teacher forward to get target outputs
Trains student block to minimize L1 distance
Optionally uses convergence-based stopping

Global Phase

Fine-tunes the entire model end-to-end:

train:
  phase: global
  lr: 0.00005  # Lower LR for fine-tuning

The global phase:

Unfreezes all parameters
Trains on next-token prediction
Uses cross-entropy loss

Llama → DBA Example

Converting Llama 3.2 1B to Decoupled Bottleneck Attention:

version: 2
name: llama_to_dba

vars:
  d_model: 2048
  n_heads: 32
  n_kv_heads: 8
  sem_dim: 128    # Semantic bottleneck
  geo_dim: 256    # Geometric bottleneck

targets:
  - type: experiment
    name: upcycle
    trainer: trainer.upcycle

    system:
      ref: system.language_model
      config:
        model:
          type: TransformerModel
          topology:
            type: StackedTopology
            layers:
              - type: NestedTopology
                repeat: 16
                layers:
                  - type: ResidualTopology
                    layers:
                      - type: RMSNormLayer
                        d_model: ${d_model}
                      - type: AttentionLayer
                        d_model: ${d_model}
                        n_heads: ${n_heads}
                        n_kv_heads: ${n_kv_heads}
                        mode: decoupled      # DBA mode
                        sem_dim: ${sem_dim}
                        geo_dim: ${geo_dim}
                        rope_enabled: true
                  # ... FFN blocks ...

    runs:
      - id: blockwise
        train:
          phase: blockwise
          teacher_ckpt: hf://meta-llama/Llama-3.2-1B
          convergence_target: 0.02

      - id: finetune
        train:
          phase: global
          lr: 0.00005

Verification

Attach verification to check quality after training:

runs:
  - id: blockwise
    verify:
      type: compare
      batches: 5
      attention:
        max_mean_l1: 0.05
        max_max_l1: 0.25
      logits:
        max_mean_l1: 0.05
        max_max_l1: 0.25

See Manifests → Verification for all options.

Orchestrated Training

Dynamic optimizer switching based on training telemetry. The orchestrator monitors loss, gradients, and training phase to select the best optimization strategy.

Why Orchestrated Training?

Different training phases benefit from different strategies:

Phase	Challenge	Strategy
Early	High gradients	Conservative clipping
Plateau	Slow progress	Momentum boost
Late	Overfitting	SGD for generalization
Spike	Loss explosion	Safety rollback

Enable Orchestration

train:
  phase: global
  orchestrator_enabled: true
  orchestrator_decision_interval: 500
  orchestrator_eval_horizon: 100
  orchestrator_initial_strategy: conservative_adamw
  orchestrator_use_adagc: true

Orchestrator Options

train:
  # Enable orchestration
  orchestrator_enabled: true

  # Decision timing
  orchestrator_decision_interval: 500  # Steps between decisions
  orchestrator_eval_horizon: 100       # Steps to evaluate each strategy

  # Initial strategy
  orchestrator_initial_strategy: conservative_adamw

  # Strategy components
  orchestrator_use_adagc: true         # Adaptive gradient clipping
  orchestrator_use_nowcasting: false   # Weight trajectory prediction

  # Safety
  orchestrator_max_loss_increase: 1.5  # Rollback threshold
  orchestrator_safety_strategy: spike_resistant

Built-in Strategies

Strategy	Description
`conservative_adamw`	Safe defaults, moderate LR, global clipping
`aggressive_adamw`	Higher LR, less clipping, faster convergence
`sgd_escape`	SGD with momentum for escaping sharp minima
`spike_resistant`	Low LR, aggressive clipping for unstable phases

SWATS: Automatic Adam → SGD

The orchestrator includes SWATS, which automatically switches from Adam to SGD when training stabilizes:

# Programmatic usage
orchestrator import SWATS, SWATSConfig

optimizer = SWATS(
    model.parameters(),
    config=SWATSConfig(
        adam_lr=1e-3,
        switch_threshold=1e-9,
        min_steps_before_switch=1000,
    ),
)

AdaGC: Adaptive Gradient Clipping

Per-parameter clipping that adapts to each parameter's gradient distribution:

train:
  orchestrator_use_adagc: true
  orchestrator_adagc_warmup: 100
  orchestrator_adagc_threshold: 3.0

Architecture

┌────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR                            │
├────────────────────────────────────────────────────────────┤
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Telemetry   │───▶│   Decision   │───▶│   Strategy   │  │
│  │   Stream     │    │   Boundary   │    │   Switch     │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│         │                   │                    │         │
│         ▼                   ▼                    ▼         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │    Spike     │    │     UCB      │    │  Speculative │  │
│  │   Detector   │    │    Bandit    │    │   Branching  │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
└────────────────────────────────────────────────────────────┘

Training Configuration

Complete Reference

train:
  # === Core Settings ===
  phase: standard           # standard, blockwise, global
  batch_size: 32
  block_size: 512
  lr: 0.0003
  device: mps               # mps, cuda, cpu
  dtype: float32            # float32, float16, bfloat16

  # === Upcycle Settings ===
  teacher_ckpt: null        # HF path or local checkpoint
  cache_teacher_outputs: false

  # === Convergence Settings ===
  convergence_target: null  # Target loss for early stopping
  convergence_patience: 100 # Steps without improvement
  convergence_max_steps: null

  # === Optimizer Settings ===
  weight_decay: 0.01
  beta1: 0.9
  beta2: 0.95
  grad_clip: 1.0

  # === LR Schedule ===
  warmup_steps: 0
  lr_schedule: cosine       # cosine, linear, constant, none
  min_lr: 0.0

  # === Mixed Precision ===
  use_amp: false
  amp_dtype: float16

  # === Optimization ===
  compile_model: false
  gradient_accumulation_steps: 1
  activation_checkpointing: false
  activation_checkpoint_threshold: null

  # === Data Loading ===
  num_workers: 0
  pin_memory: false

  # === Orchestrator ===
  orchestrator_enabled: false
  orchestrator_decision_interval: 500
  orchestrator_eval_horizon: 100
  orchestrator_initial_strategy: conservative_adamw
  orchestrator_use_adagc: false
  orchestrator_use_nowcasting: false

Device Selection

Device	When to Use
`mps`	Apple Silicon (M1/M2/M3/M4)
`cuda`	NVIDIA GPUs
`cpu`	Development/testing only

Data Type Selection

dtype	Precision	Memory	Speed	Use Case
`float32`	Full	High	Baseline	Training stability
`float16`	Half	Low	Fast	Inference, AMP
`bfloat16`	Brain float	Low	Fast	Training on Ampere+

Distributed Training

Scale training to multiple GPUs with DDP or FSDP.

Data Parallel (DDP)

For models that fit on a single GPU:

trainer import DistributedConfig, DistributedStrategy

dist_config = DistributedConfig(
    strategy=DistributedStrategy.DDP,
    ddp_find_unused_parameters=False,
)

Launch:

torchrun --nproc_per_node=4 train.py

Fully Sharded (FSDP)

For models that don't fit on a single GPU:

dist_config = DistributedConfig(
    strategy=DistributedStrategy.FSDP,
    fsdp_sharding_strategy="FULL_SHARD",
    fsdp_mixed_precision=True,
    fsdp_activation_checkpointing=True,
    fsdp_transformer_layer_cls=["TransformerBlock"],
)

Distributed Utilities

trainer.distributed import (
    is_distributed,
    get_rank,
    get_world_size,
    is_main_process,
)

if is_main_process():
    print(f"Training on {get_world_size()} GPUs")

Training Presets

Quick Experiments

Use the quick target for fast iteration:

python3 -m caramba config/presets/llama32_1b_dba.yml --target quick

This runs with:

Reduced steps (50 blockwise, 100 global)
Smaller block size (512)
Minimal benchmarks

Full Paper Runs

Use the paper target for publication-quality experiments:

python3 -m caramba config/presets/llama32_1b_dba.yml --target paper

This runs with:

Full training (500 blockwise, 2000 global)
Full block size (2048)
Complete benchmarks (perplexity, latency, memory)
Artifact generation (CSV, PNG, LaTeX)

Summary

Mode	Trainer	Phases	Use Case
Standard	`trainer.standard`	`standard`	From-scratch training
Upcycle	`trainer.upcycle`	`blockwise` → `global`	Architecture surgery
Orchestrated	Any + flags	Any	Adaptive optimization

← Topologies · Inference →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎓 Training Guide

📋 Table of Contents

Overview

Standard Training

Basic Configuration

Full Options

Example: Training a Small Transformer

Upcycle Training

The Upcycle Pipeline

Upcycle Configuration

Training Phases

Blockwise Phase

Blockwise autopilot (self-tuning)

Global Phase

Llama → DBA Example

Verification

Orchestrated Training

Why Orchestrated Training?

Enable Orchestration

Orchestrator Options

Built-in Strategies

SWATS: Automatic Adam → SGD

AdaGC: Adaptive Gradient Clipping

Architecture

Training Configuration

Complete Reference

Device Selection

Data Type Selection

Distributed Training

Data Parallel (DDP)

Fully Sharded (FSDP)

Distributed Utilities

Training Presets

Quick Experiments

Full Paper Runs

Summary

FilesExpand file tree

training.md

Latest commit

History

training.md

File metadata and controls

🎓 Training Guide

📋 Table of Contents

Overview

Standard Training

Basic Configuration

Full Options

Example: Training a Small Transformer

Upcycle Training

The Upcycle Pipeline

Upcycle Configuration

Training Phases

Blockwise Phase

Blockwise autopilot (self-tuning)

Global Phase

Llama → DBA Example

Verification

Orchestrated Training

Why Orchestrated Training?

Enable Orchestration

Orchestrator Options

Built-in Strategies

SWATS: Automatic Adam → SGD

AdaGC: Adaptive Gradient Clipping

Architecture

Training Configuration

Complete Reference

Device Selection

Data Type Selection

Distributed Training

Data Parallel (DDP)

Fully Sharded (FSDP)

Distributed Utilities

Training Presets

Quick Experiments

Full Paper Runs

Summary