Skip to content

setup-1024seq-l40s-scale-run #38

Description

@david-thrower

Training Plan: Max out what is feasible on an L40S GPU (Seq len 1024, 79M param model)

HelixLM d1024/n2/ffn2.7/nl4/s1024

Prerequisites

  • Validate the handling of out - of - memory data with pre-chunking and pre-tokenization...

Final Configuration (Optimized)

Parameter Value Rationale
d_model 1024 High dim > many columns (upstream Cerebros insight)
ffn_expansion 2.7 Per PanGu-π
n_columns 2 Simple graph, faster training
n_loops 4 Recurrent depth (seq_len scale)
seq_len 1024 Target context length
n_heads 16 1024/64 per head
vocab_size 50,257 GPT-2 (proven efficient)

Parameters:

  • Active: 27.5M (1.98× reference)
  • Total: ~79M
  • TPA @ 3B: 109 (matches reference saturation)

Token Budget (3B Pretrain)

Phase Tokens TPA Notes
Pretraining 3.0B - 4.0B 109 Chinchilla++ efficient
Instruct tuning 120M - 400M - 4% of pretrain
Remedial reserve 600M - For gap-filling
- - Trained over ~20 hours

Training Config

D_MODEL = 1024
N_HEADS = 16
FFN_EXPANSION = 2.7
SEQ_LEN = 1024
N_LOOPS = 4
N_COLUMNS = 2
VOCAB_SIZE = 50257

# Topology
LATERAL_P = 0.8
VERTICAL_P = 0.9
VERTICAL_DEPTH = 2

# Regularization
DROPOUT = 0.15 # Maybe 0.1 0.08 in pretrain. Certainly 0.05 - 0.1  in instruct fine tuning.
WEIGHT_DECAY = 0.05

# Training (memory-optimized)
BATCH_SIZE = 16
GRAD_ACCUM = 4  # effective 64
USE_AMP = True
AMP_DTYPE = "bfloat16"

# Learning Rate Stages
LR_STAGES = [2e-3, 1e-3, 3e-4]
WARMUP_STAGES = [100, 10, 10]

Phase-by-Phase Punchlist

Phase 1: Data Preparation

  • Curate 3B+ token English corpus (99% fienwebedu, 1% openweb math)
  • 98/8 train/val split

Phase 2: Model Setup

cfg = HelixConfig(
    d_model=1024,
    n_heads=16,
    n_columns=2,
    n_loops=4,
    seq_len=1024,
    ffn_expansion=2.7,
    lateral_p=0.8,
    vertical_p=0.9,
    vertical_depth=2,
    use_rope=True,
    dropout=0.15,
)
  • Verify graph: 2 columns × 3 nodes = ~6 total nodes
  • Test forward: batch=16, seq=1024
  • VRAM check: should fit in ~40GB (48GB available)

Phase 3: Training Execution

  • Stage 1: lr=2e-3, ~20% tokens
  • Stage 2: lr=1e-3, ~60% tokens
  • Stage 3: lr=3e-4, ~20% tokens
  • Track tokens/sec, eta, loss
  • Monitor gradient norms

Phase 4: Saturation Validation

  • Weight Watcher spectral analysis
  • Target: >90% layers saturated
  • If under: use 600M remedial reserve

Phase 5: Instruct Tuning

  • 120M conversation tokens
  • LR: 5e-5 constant
  • Early stop on convergence

Success Criteria

Metric Target
Active params 27.5M (2× ref)
TPA 109 @ 3B tokens
Train tokens 3.0B ± 10%
Hours < 24
Val PPL < ref × 1.2
Saturation > 90%

Scale-Up Path

If successful (verified saturation):

  • seq_len: 1024 → 2048 → 3096
  • n_loops: 4 → 5 (if time permits)
  • Next model: d1024, n_columns=3 (~41M active)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions