Skip to content

accelerate-streaming-dataset-loading #35

Description

@david-thrower

Title: Streaming Dataset Performance Bottleneck and Progress Bar Issues

Description

The API enhancements for handling streaming datasets implemented in #34 have
two critical issues affecting training time and monitoring accuracy.

Issue 1: CPU Bottleneck in Preprocessing

Expected vs. Actual Performance:

• Config: 512 seq len, d_model 512, 1.5B tokens, 3 recurrent loops
• Throughput: 29,000 tokens/sec (healthy for 3 loops)
• Expected completion: 9-12 hours
• Actual: >18 hours with no epoch completion

Performance Calculation:

Factor Impact
4x data (1.5B vs 400M token baseline) 4x train time
4x batching time +4x overhead
1/4 steps -4x time (cumulatively net 0 change)
2x grad accumulation, 0.5x batch size net 0 (cumulatively net 0 change)
1.5x recurrence 1.5x time (cumulatively net 1.5 X change)
Linear attention (4x seq_len benefit) ~4x time (cumulatively net 4.5 change)
Net expected ~4.5x the 2 1/2 hour baseline ~9-12 hours

Suspected Cause: CPU-bound batched preprocessing (tokenization) creating GPU
idle time. Previous 400M token run with 2 loops completed in ~2.5 hours using
List[str] instead of iterable dataset.

Questions:

1 Is the bottleneck confirmation correct, or is there a math error?
2 If bottleneck is real, what are the options to satisfy both:
• Streaming (don't load everything in memory)
• No batching wait bottleneck

Potential Solutions:

• Pre-tokenize and cache to disk (count steps upfront)
• Stream on-the-fly with preprocessing bottleneck fixed

Issue 2: Progress Bar Regression

Current Behavior (Streaming):

Epoch 1: 256527 batch [18:13:53, 3.91batch/s, loss=5.1309, ppl=169.16,
lr=1.00e-03, tok/s=29,007]

• Missing percent completion
• Batch number shown without total batches
• Estimated batch count exists but TQDM disabled

Expected Behavior (List[str]):

Epoch 25: 100%|██████████| 101/101 [01:54<00:00, 1.13s/batch, loss=2.4469,
ppl=11.55, lr=2.63e-04, tok/s=679]

• Shows percent completion
• Shows current/total batches
• Full progress bar

Root Cause: Streaming datasets were set up to use estimated steps, and TQDM
was disabled. The corrected logic to count batches first exists but may not be
properly integrated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions