Training Plan: Max out what is feasible on an L40S GPU (Seq len 1024, 79M param model)
HelixLM d1024/n2/ffn2.7/nl4/s1024
Prerequisites
- Validate the handling of out - of - memory data with pre-chunking and pre-tokenization...
Final Configuration (Optimized)
| Parameter |
Value |
Rationale |
| d_model |
1024 |
High dim > many columns (upstream Cerebros insight) |
| ffn_expansion |
2.7 |
Per PanGu-π |
| n_columns |
2 |
Simple graph, faster training |
| n_loops |
4 |
Recurrent depth (seq_len scale) |
| seq_len |
1024 |
Target context length |
| n_heads |
16 |
1024/64 per head |
| vocab_size |
50,257 |
GPT-2 (proven efficient) |
Parameters:
- Active: 27.5M (1.98× reference)
- Total: ~79M
- TPA @ 3B: 109 (matches reference saturation)
Token Budget (3B Pretrain)
| Phase |
Tokens |
TPA |
Notes |
| Pretraining |
3.0B - 4.0B |
109 |
Chinchilla++ efficient |
| Instruct tuning |
120M - 400M |
- |
4% of pretrain |
| Remedial reserve |
600M |
- |
For gap-filling |
| - |
- |
Trained over ~20 hours |
|
Training Config
D_MODEL = 1024
N_HEADS = 16
FFN_EXPANSION = 2.7
SEQ_LEN = 1024
N_LOOPS = 4
N_COLUMNS = 2
VOCAB_SIZE = 50257
# Topology
LATERAL_P = 0.8
VERTICAL_P = 0.9
VERTICAL_DEPTH = 2
# Regularization
DROPOUT = 0.15 # Maybe 0.1 0.08 in pretrain. Certainly 0.05 - 0.1 in instruct fine tuning.
WEIGHT_DECAY = 0.05
# Training (memory-optimized)
BATCH_SIZE = 16
GRAD_ACCUM = 4 # effective 64
USE_AMP = True
AMP_DTYPE = "bfloat16"
# Learning Rate Stages
LR_STAGES = [2e-3, 1e-3, 3e-4]
WARMUP_STAGES = [100, 10, 10]
Phase-by-Phase Punchlist
Phase 1: Data Preparation
Phase 2: Model Setup
cfg = HelixConfig(
d_model=1024,
n_heads=16,
n_columns=2,
n_loops=4,
seq_len=1024,
ffn_expansion=2.7,
lateral_p=0.8,
vertical_p=0.9,
vertical_depth=2,
use_rope=True,
dropout=0.15,
)
Phase 3: Training Execution
Phase 4: Saturation Validation
Phase 5: Instruct Tuning
Success Criteria
| Metric |
Target |
| Active params |
27.5M (2× ref) |
| TPA |
109 @ 3B tokens |
| Train tokens |
3.0B ± 10% |
| Hours |
< 24 |
| Val PPL |
< ref × 1.2 |
| Saturation |
> 90% |
Scale-Up Path
If successful (verified saturation):
- seq_len: 1024 → 2048 → 3096
- n_loops: 4 → 5 (if time permits)
- Next model: d1024, n_columns=3 (~41M active)
Training Plan: Max out what is feasible on an L40S GPU (Seq len 1024, 79M param model)
HelixLM d1024/n2/ffn2.7/nl4/s1024
Prerequisites
Final Configuration (Optimized)
Parameters:
Token Budget (3B Pretrain)
Training Config
Phase-by-Phase Punchlist
Phase 1: Data Preparation
Phase 2: Model Setup
Phase 3: Training Execution
Phase 4: Saturation Validation
Phase 5: Instruct Tuning
Success Criteria
Scale-Up Path
If successful (verified saturation):