caramba supports multiple training paradigms, from simple end-to-end training to sophisticated architecture surgery with distillation. This guide covers all training modes and their configurations.
- Overview
- Standard Training
- Upcycle Training
- Orchestrated Training
- Training Configuration
- Distributed Training
caramba provides three training modes:
| Mode | Trainer | Use Case |
|---|---|---|
| Standard | trainer.standard |
Training from scratch or fine-tuning |
| Upcycle | trainer.upcycle |
Architecture surgery + distillation |
| Orchestrated | Built into runs | Dynamic optimizer switching |
Each mode is selected by the trainer field and configured in train:
targets:
- type: experiment
trainer: trainer.standard # or trainer.upcycle
runs:
- id: train
train:
phase: standard
orchestrator_enabled: false # Enable for orchestrated modeEnd-to-end training from scratch or fine-tuning an existing model.
trainer: trainer.standard
runs:
- id: train
mode: train
steps: 10000
train:
phase: standard
batch_size: 32
block_size: 512
lr: 0.0003
device: mps
dtype: float32train:
phase: standard
# Core settings
batch_size: 32 # Training batch size
block_size: 512 # Sequence length
lr: 0.0003 # Learning rate
device: mps # Device: mps, cuda, cpu
dtype: float32 # Data type: float32, float16, bfloat16
# Optimizer settings
weight_decay: 0.01 # Weight decay (AdamW)
beta1: 0.9 # Adam beta1
beta2: 0.95 # Adam beta2
grad_clip: 1.0 # Gradient clipping norm
# Learning rate schedule
warmup_steps: 100 # LR warmup steps
lr_schedule: cosine # cosine, linear, constant
min_lr: 0.00001 # Minimum LR for schedule
# Optimization features
use_amp: false # Automatic mixed precision
amp_dtype: float16 # AMP dtype
compile_model: false # torch.compile optimization
gradient_accumulation_steps: 1
# Data loading
num_workers: 4 # DataLoader workers
pin_memory: true # Pin memory for GPU transferversion: 2
name: train_from_scratch
vars:
d_model: 256
n_heads: 4
n_layers: 4
targets:
- type: experiment
name: baseline
trainer: trainer.standard
runs:
- id: train
mode: train
steps: 5000
train:
phase: standard
batch_size: 16
block_size: 256
lr: 0.001
device: mps
dtype: float32
warmup_steps: 200
lr_schedule: cosineRun it:
python3 -m caramba my_training.ymlArchitecture surgery that converts a pretrained model to a new architecture while preserving learned representations. This is caramba's flagship feature for attention surgery research.
┌──────────────────────────────────────────────────────────┐
│ UPCYCLE PIPELINE │
├──────────────────────────────────────────────────────────┤
│ 1. Load Teacher Load pretrained checkpoint │
│ 2. Build Student Create target architecture │
│ 3. Surgery SVD-based weight initialization │
│ 4. Blockwise Layer-by-layer distillation │
│ 5. Global End-to-end fine-tuning │
│ 6. Verify Compare teacher/student outputs │
│ 7. Benchmark Measure quality/speed/memory │
└──────────────────────────────────────────────────────────┘
trainer: trainer.upcycle
runs:
# Phase 1: Blockwise distillation
- id: blockwise
mode: train
steps: 500
train:
phase: blockwise
teacher_ckpt: hf://meta-llama/Llama-3.2-1B
batch_size: 1
block_size: 2048
lr: 0.0001
device: mps
dtype: float32
# Phase 2: Global fine-tuning
- id: finetune
mode: train
steps: 2000
train:
phase: global
batch_size: 1
block_size: 2048
lr: 0.00005
device: mps
dtype: float32Trains each layer to match teacher outputs:
train:
phase: blockwise
teacher_ckpt: hf://meta-llama/Llama-3.2-1B
# Convergence-based training (optional)
convergence_target: 0.02 # Target L1 loss
convergence_patience: 100 # Steps without improvement
convergence_max_steps: 2000 # Max steps per block
# Teacher output caching
cache_teacher_outputs: true # Cache for speedBy default, blockwise uses a fixed LR and (optionally) convergence-based early stopping. If you want it to self-tune and to surface those decisions in console logs, enable the lightweight blockwise autotuner:
train:
phase: blockwise
blockwise_autotune_enabled: true
blockwise_autotune_mode: monitor # monitor|active
# Tuning knobs (optional)
blockwise_autotune_plateau_patience: 100
blockwise_autotune_lr_decay: 0.5
blockwise_autotune_min_lr: 1e-6
blockwise_autotune_log_every: 50- monitor: logs spikes/plateaus and what it would do (no LR changes)
- active: reduces LR on spikes/plateaus within each block
The blockwise phase:
- Iterates through each transformer block
- Runs teacher forward to get target outputs
- Trains student block to minimize L1 distance
- Optionally uses convergence-based stopping
Fine-tunes the entire model end-to-end:
train:
phase: global
lr: 0.00005 # Lower LR for fine-tuningThe global phase:
- Unfreezes all parameters
- Trains on next-token prediction
- Uses cross-entropy loss
Converting Llama 3.2 1B to Decoupled Bottleneck Attention:
version: 2
name: llama_to_dba
vars:
d_model: 2048
n_heads: 32
n_kv_heads: 8
sem_dim: 128 # Semantic bottleneck
geo_dim: 256 # Geometric bottleneck
targets:
- type: experiment
name: upcycle
trainer: trainer.upcycle
system:
ref: system.language_model
config:
model:
type: TransformerModel
topology:
type: StackedTopology
layers:
- type: NestedTopology
repeat: 16
layers:
- type: ResidualTopology
layers:
- type: RMSNormLayer
d_model: ${d_model}
- type: AttentionLayer
d_model: ${d_model}
n_heads: ${n_heads}
n_kv_heads: ${n_kv_heads}
mode: decoupled # DBA mode
sem_dim: ${sem_dim}
geo_dim: ${geo_dim}
rope_enabled: true
# ... FFN blocks ...
runs:
- id: blockwise
train:
phase: blockwise
teacher_ckpt: hf://meta-llama/Llama-3.2-1B
convergence_target: 0.02
- id: finetune
train:
phase: global
lr: 0.00005Attach verification to check quality after training:
runs:
- id: blockwise
verify:
type: compare
batches: 5
attention:
max_mean_l1: 0.05
max_max_l1: 0.25
logits:
max_mean_l1: 0.05
max_max_l1: 0.25See Manifests → Verification for all options.
Dynamic optimizer switching based on training telemetry. The orchestrator monitors loss, gradients, and training phase to select the best optimization strategy.
Different training phases benefit from different strategies:
| Phase | Challenge | Strategy |
|---|---|---|
| Early | High gradients | Conservative clipping |
| Plateau | Slow progress | Momentum boost |
| Late | Overfitting | SGD for generalization |
| Spike | Loss explosion | Safety rollback |
train:
phase: global
orchestrator_enabled: true
orchestrator_decision_interval: 500
orchestrator_eval_horizon: 100
orchestrator_initial_strategy: conservative_adamw
orchestrator_use_adagc: truetrain:
# Enable orchestration
orchestrator_enabled: true
# Decision timing
orchestrator_decision_interval: 500 # Steps between decisions
orchestrator_eval_horizon: 100 # Steps to evaluate each strategy
# Initial strategy
orchestrator_initial_strategy: conservative_adamw
# Strategy components
orchestrator_use_adagc: true # Adaptive gradient clipping
orchestrator_use_nowcasting: false # Weight trajectory prediction
# Safety
orchestrator_max_loss_increase: 1.5 # Rollback threshold
orchestrator_safety_strategy: spike_resistant| Strategy | Description |
|---|---|
conservative_adamw |
Safe defaults, moderate LR, global clipping |
aggressive_adamw |
Higher LR, less clipping, faster convergence |
sgd_escape |
SGD with momentum for escaping sharp minima |
spike_resistant |
Low LR, aggressive clipping for unstable phases |
The orchestrator includes SWATS, which automatically switches from Adam to SGD when training stabilizes:
# Programmatic usage
orchestrator import SWATS, SWATSConfig
optimizer = SWATS(
model.parameters(),
config=SWATSConfig(
adam_lr=1e-3,
switch_threshold=1e-9,
min_steps_before_switch=1000,
),
)Per-parameter clipping that adapts to each parameter's gradient distribution:
train:
orchestrator_use_adagc: true
orchestrator_adagc_warmup: 100
orchestrator_adagc_threshold: 3.0┌────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
├────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Telemetry │───▶│ Decision │───▶│ Strategy │ │
│ │ Stream │ │ Boundary │ │ Switch │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Spike │ │ UCB │ │ Speculative │ │
│ │ Detector │ │ Bandit │ │ Branching │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────────┘
train:
# === Core Settings ===
phase: standard # standard, blockwise, global
batch_size: 32
block_size: 512
lr: 0.0003
device: mps # mps, cuda, cpu
dtype: float32 # float32, float16, bfloat16
# === Upcycle Settings ===
teacher_ckpt: null # HF path or local checkpoint
cache_teacher_outputs: false
# === Convergence Settings ===
convergence_target: null # Target loss for early stopping
convergence_patience: 100 # Steps without improvement
convergence_max_steps: null
# === Optimizer Settings ===
weight_decay: 0.01
beta1: 0.9
beta2: 0.95
grad_clip: 1.0
# === LR Schedule ===
warmup_steps: 0
lr_schedule: cosine # cosine, linear, constant, none
min_lr: 0.0
# === Mixed Precision ===
use_amp: false
amp_dtype: float16
# === Optimization ===
compile_model: false
gradient_accumulation_steps: 1
activation_checkpointing: false
activation_checkpoint_threshold: null
# === Data Loading ===
num_workers: 0
pin_memory: false
# === Orchestrator ===
orchestrator_enabled: false
orchestrator_decision_interval: 500
orchestrator_eval_horizon: 100
orchestrator_initial_strategy: conservative_adamw
orchestrator_use_adagc: false
orchestrator_use_nowcasting: false| Device | When to Use |
|---|---|
mps |
Apple Silicon (M1/M2/M3/M4) |
cuda |
NVIDIA GPUs |
cpu |
Development/testing only |
| dtype | Precision | Memory | Speed | Use Case |
|---|---|---|---|---|
float32 |
Full | High | Baseline | Training stability |
float16 |
Half | Low | Fast | Inference, AMP |
bfloat16 |
Brain float | Low | Fast | Training on Ampere+ |
Scale training to multiple GPUs with DDP or FSDP.
For models that fit on a single GPU:
trainer import DistributedConfig, DistributedStrategy
dist_config = DistributedConfig(
strategy=DistributedStrategy.DDP,
ddp_find_unused_parameters=False,
)Launch:
torchrun --nproc_per_node=4 train.pyFor models that don't fit on a single GPU:
dist_config = DistributedConfig(
strategy=DistributedStrategy.FSDP,
fsdp_sharding_strategy="FULL_SHARD",
fsdp_mixed_precision=True,
fsdp_activation_checkpointing=True,
fsdp_transformer_layer_cls=["TransformerBlock"],
)trainer.distributed import (
is_distributed,
get_rank,
get_world_size,
is_main_process,
)
if is_main_process():
print(f"Training on {get_world_size()} GPUs")Use the quick target for fast iteration:
python3 -m caramba config/presets/llama32_1b_dba.yml --target quickThis runs with:
- Reduced steps (50 blockwise, 100 global)
- Smaller block size (512)
- Minimal benchmarks
Use the paper target for publication-quality experiments:
python3 -m caramba config/presets/llama32_1b_dba.yml --target paperThis runs with:
- Full training (500 blockwise, 2000 global)
- Full block size (2048)
- Complete benchmarks (perplexity, latency, memory)
- Artifact generation (CSV, PNG, LaTeX)
| Mode | Trainer | Phases | Use Case |
|---|---|---|---|
| Standard | trainer.standard |
standard |
From-scratch training |
| Upcycle | trainer.upcycle |
blockwise → global |
Architecture surgery |
| Orchestrated | Any + flags | Any | Adaptive optimization |