setup-1024seq-l40s-scale-run



# Training Plan: Max out what is feasible on an L40S GPU (Seq len 1024, 79M param model)

 HelixLM d1024/n2/ffn2.7/nl4/s1024

## Prerequisites

- Validate the handling of out - of - memory data with pre-chunking and pre-tokenization... 

## Final Configuration (Optimized)

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **d_model** | 1024 | High dim > many columns (upstream Cerebros insight) |
| **ffn_expansion** | 2.7 | Per PanGu-π |
| **n_columns** | 2 | Simple graph, faster training |
| **n_loops** | 4 | Recurrent depth (seq_len scale) |
| **seq_len** | 1024 | Target context length |
| **n_heads** | 16 | 1024/64 per head |
| **vocab_size** | 50,257 | GPT-2 (proven efficient) |

**Parameters:**
- Active: 27.5M (1.98× reference)
- Total: ~79M
- TPA @ 3B: 109 (matches reference saturation)

---

## Token Budget (3B Pretrain)

| Phase | Tokens | TPA | Notes |
|-------|--------|-----|-------|
| Pretraining | **3.0B - 4.0B** | 109 | Chinchilla++ efficient |
| Instruct tuning | 120M - 400M  | - | 4% of pretrain |
| Remedial reserve | 600M | - | For gap-filling |
| - | -  | Trained over ~20 hours |

---

## Training Config

```python
D_MODEL = 1024
N_HEADS = 16
FFN_EXPANSION = 2.7
SEQ_LEN = 1024
N_LOOPS = 4
N_COLUMNS = 2
VOCAB_SIZE = 50257

# Topology
LATERAL_P = 0.8
VERTICAL_P = 0.9
VERTICAL_DEPTH = 2

# Regularization
DROPOUT = 0.15 # Maybe 0.1 0.08 in pretrain. Certainly 0.05 - 0.1  in instruct fine tuning.
WEIGHT_DECAY = 0.05

# Training (memory-optimized)
BATCH_SIZE = 16
GRAD_ACCUM = 4  # effective 64
USE_AMP = True
AMP_DTYPE = "bfloat16"

# Learning Rate Stages
LR_STAGES = [2e-3, 1e-3, 3e-4]
WARMUP_STAGES = [100, 10, 10]
```

---

## Phase-by-Phase Punchlist

### Phase 1: Data Preparation
- [ ] Curate 3B+ token English corpus (99% fienwebedu, 1% openweb math)
- [ ] 98/8 train/val split

### Phase 2: Model Setup
```python
cfg = HelixConfig(
    d_model=1024,
    n_heads=16,
    n_columns=2,
    n_loops=4,
    seq_len=1024,
    ffn_expansion=2.7,
    lateral_p=0.8,
    vertical_p=0.9,
    vertical_depth=2,
    use_rope=True,
    dropout=0.15,
)
```
- [ ] Verify graph: 2 columns × 3 nodes = ~6 total nodes
- [ ] Test forward: batch=16, seq=1024
- [ ] VRAM check: should fit in ~40GB (48GB available)

### Phase 3: Training Execution
- [ ] Stage 1: lr=2e-3, ~20% tokens
- [ ] Stage 2: lr=1e-3, ~60% tokens  
- [ ] Stage 3: lr=3e-4, ~20% tokens
- [ ] Track tokens/sec, eta, loss
- [ ] Monitor gradient norms

### Phase 4: Saturation Validation
- [ ] Weight Watcher spectral analysis
- [ ] Target: >90% layers saturated
- [ ] If under: use 600M remedial reserve

### Phase 5: Instruct Tuning
- [ ] 120M conversation tokens
- [ ] LR: 5e-5 constant
- [ ] Early stop on convergence

---

## Success Criteria

| Metric | Target |
|--------|--------|
| Active params | 27.5M (2× ref) |
| TPA | 109 @ 3B tokens |
| Train tokens | 3.0B ± 10% |
| Hours | < 24 |
| Val PPL | < ref × 1.2 |
| Saturation | > 90% |

---

## Scale-Up Path

If successful (verified saturation):
- seq_len: 1024 → 2048 → 3096
- n_loops: 4 → 5 (if time permits)
- Next model: d1024, n_columns=3 (~41M active)

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setup-1024seq-l40s-scale-run #38

Training Plan: Max out what is feasible on an L40S GPU (Seq len 1024, 79M param model)

Prerequisites

Final Configuration (Optimized)

Token Budget (3B Pretrain)

Training Config

Phase-by-Phase Punchlist

Phase 1: Data Preparation

Phase 2: Model Setup

Phase 3: Training Execution

Phase 4: Saturation Validation

Phase 5: Instruct Tuning

Success Criteria

Scale-Up Path

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Parameter	Value	Rationale
d_model	1024	High dim > many columns (upstream Cerebros insight)
ffn_expansion	2.7	Per PanGu-π
n_columns	2	Simple graph, faster training
n_loops	4	Recurrent depth (seq_len scale)
seq_len	1024	Target context length
n_heads	16	1024/64 per head
vocab_size	50,257	GPT-2 (proven efficient)

Phase	Tokens	TPA	Notes
Pretraining	3.0B - 4.0B	109	Chinchilla++ efficient
Instruct tuning	120M - 400M	-	4% of pretrain
Remedial reserve	600M	-	For gap-filling
-	-	Trained over ~20 hours

Metric	Target
Active params	27.5M (2× ref)
TPA	109 @ 3B tokens
Train tokens	3.0B ± 10%
Hours	< 24
Val PPL	< ref × 1.2
Saturation	> 90%

setup-1024seq-l40s-scale-run #38

Description

Training Plan: Max out what is feasible on an L40S GPU (Seq len 1024, 79M param model)

Prerequisites

Final Configuration (Optimized)

Token Budget (3B Pretrain)

Training Config

Phase-by-Phase Punchlist

Phase 1: Data Preparation

Phase 2: Model Setup

Phase 3: Training Execution

Phase 4: Saturation Validation

Phase 5: Instruct Tuning

Success Criteria

Scale-Up Path

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions