accelerate-streaming-dataset-loading



  Title: Streaming Dataset Performance Bottleneck and Progress Bar Issues


  Description

  The API enhancements for handling streaming datasets implemented in #34 have
  two critical issues affecting training time and monitoring accuracy.


  Issue 1: CPU Bottleneck in Preprocessing


  Expected vs. Actual Performance:

   • Config: 512 seq len, d_model 512, 1.5B tokens, 3 recurrent loops
   • Throughput: 29,000 tokens/sec (healthy for 3 loops)
   • Expected completion: 9-12 hours
   • Actual: >18 hours with no epoch completion


  Performance Calculation:

                                                                 
   Factor                                 Impact                 
   4x data (1.5B vs 400M token baseline)          4x train time          
   4x batching time                       +4x overhead           
   1/4 steps                              -4x time (cumulatively net 0 change)                
   2x grad accumulation, 0.5x batch size  net 0 (cumulatively net 0 change)                  
   1.5x recurrence                        1.5x time (cumulatively net 1.5 X change)              
   Linear attention (4x seq_len benefit)  ~4x time (cumulatively net 4.5 change)            
   Net expected                           ~4.5x the 2 1/2 hour baseline ~9-12 hours 



  Suspected Cause: CPU-bound batched preprocessing (tokenization) creating GPU
  idle time. Previous 400M token run with 2 loops completed in ~2.5 hours using
  List[str] instead of iterable dataset.


  Questions:

   1 Is the bottleneck confirmation correct, or is there a math error?
   2 If bottleneck is real, what are the options to satisfy both:
      • Streaming (don't load everything in memory)
      • No batching wait bottleneck


  Potential Solutions:

   • Pre-tokenize and cache to disk (count steps upfront)
   • Stream on-the-fly with preprocessing bottleneck fixed


  Issue 2: Progress Bar Regression


  Current Behavior (Streaming):

                                                                                
   Epoch 1: 256527 batch [18:13:53, 3.91batch/s, loss=5.1309, ppl=169.16,       
   lr=1.00e-03, tok/s=29,007]                                                   
                                                                                

   • Missing percent completion
   • Batch number shown without total batches
   • Estimated batch count exists but TQDM disabled


  Expected Behavior (List[str]):

                                                                                
   Epoch 25: 100%|██████████| 101/101 [01:54<00:00, 1.13s/batch, loss=2.4469,   
   ppl=11.55, lr=2.63e-04, tok/s=679]                                           
                                                                                

   • Shows percent completion
   • Shows current/total batches
   • Full progress bar


  Root Cause: Streaming datasets were set up to use estimated steps, and TQDM
  was disabled. The corrected logic to count batches first exists but may not be
  properly integrated.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate-streaming-dataset-loading #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

accelerate-streaming-dataset-loading #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions