Title: Streaming Dataset Performance Bottleneck and Progress Bar Issues
Description
The API enhancements for handling streaming datasets implemented in #34 have
two critical issues affecting training time and monitoring accuracy.
Issue 1: CPU Bottleneck in Preprocessing
Expected vs. Actual Performance:
• Config: 512 seq len, d_model 512, 1.5B tokens, 3 recurrent loops
• Throughput: 29,000 tokens/sec (healthy for 3 loops)
• Expected completion: 9-12 hours
• Actual: >18 hours with no epoch completion
Performance Calculation:
Factor Impact
4x data (1.5B vs 400M token baseline) 4x train time
4x batching time +4x overhead
1/4 steps -4x time (cumulatively net 0 change)
2x grad accumulation, 0.5x batch size net 0 (cumulatively net 0 change)
1.5x recurrence 1.5x time (cumulatively net 1.5 X change)
Linear attention (4x seq_len benefit) ~4x time (cumulatively net 4.5 change)
Net expected ~4.5x the 2 1/2 hour baseline ~9-12 hours
Suspected Cause: CPU-bound batched preprocessing (tokenization) creating GPU
idle time. Previous 400M token run with 2 loops completed in ~2.5 hours using
List[str] instead of iterable dataset.
Questions:
1 Is the bottleneck confirmation correct, or is there a math error?
2 If bottleneck is real, what are the options to satisfy both:
• Streaming (don't load everything in memory)
• No batching wait bottleneck
Potential Solutions:
• Pre-tokenize and cache to disk (count steps upfront)
• Stream on-the-fly with preprocessing bottleneck fixed
Issue 2: Progress Bar Regression
Current Behavior (Streaming):
Epoch 1: 256527 batch [18:13:53, 3.91batch/s, loss=5.1309, ppl=169.16,
lr=1.00e-03, tok/s=29,007]
• Missing percent completion
• Batch number shown without total batches
• Estimated batch count exists but TQDM disabled
Expected Behavior (List[str]):
Epoch 25: 100%|██████████| 101/101 [01:54<00:00, 1.13s/batch, loss=2.4469,
ppl=11.55, lr=2.63e-04, tok/s=679]
• Shows percent completion
• Shows current/total batches
• Full progress bar
Root Cause: Streaming datasets were set up to use estimated steps, and TQDM
was disabled. The corrected logic to count batches first exists but may not be
properly integrated.
Title: Streaming Dataset Performance Bottleneck and Progress Bar Issues
Description
The API enhancements for handling streaming datasets implemented in #34 have
two critical issues affecting training time and monitoring accuracy.
Issue 1: CPU Bottleneck in Preprocessing
Expected vs. Actual Performance:
• Config: 512 seq len, d_model 512, 1.5B tokens, 3 recurrent loops
• Throughput: 29,000 tokens/sec (healthy for 3 loops)
• Expected completion: 9-12 hours
• Actual: >18 hours with no epoch completion
Performance Calculation:
Factor Impact
4x data (1.5B vs 400M token baseline) 4x train time
4x batching time +4x overhead
1/4 steps -4x time (cumulatively net 0 change)
2x grad accumulation, 0.5x batch size net 0 (cumulatively net 0 change)
1.5x recurrence 1.5x time (cumulatively net 1.5 X change)
Linear attention (4x seq_len benefit) ~4x time (cumulatively net 4.5 change)
Net expected ~4.5x the 2 1/2 hour baseline ~9-12 hours
Suspected Cause: CPU-bound batched preprocessing (tokenization) creating GPU
idle time. Previous 400M token run with 2 loops completed in ~2.5 hours using
List[str] instead of iterable dataset.
Questions:
1 Is the bottleneck confirmation correct, or is there a math error?
2 If bottleneck is real, what are the options to satisfy both:
• Streaming (don't load everything in memory)
• No batching wait bottleneck
Potential Solutions:
• Pre-tokenize and cache to disk (count steps upfront)
• Stream on-the-fly with preprocessing bottleneck fixed
Issue 2: Progress Bar Regression
Current Behavior (Streaming):
Epoch 1: 256527 batch [18:13:53, 3.91batch/s, loss=5.1309, ppl=169.16,
lr=1.00e-03, tok/s=29,007]
• Missing percent completion
• Batch number shown without total batches
• Estimated batch count exists but TQDM disabled
Expected Behavior (List[str]):
Epoch 25: 100%|██████████| 101/101 [01:54<00:00, 1.13s/batch, loss=2.4469,
ppl=11.55, lr=2.63e-04, tok/s=679]
• Shows percent completion
• Shows current/total batches
• Full progress bar
Root Cause: Streaming datasets were set up to use estimated steps, and TQDM
was disabled. The corrected logic to count batches first exists but may not be
properly integrated.