Unable to use 'convert_dataset.py' to load data

I am getting server disconnected error when I am using convert_dataset.py', even for bookcorpus or wikipedia dataset. 
If I do, `stream=False` in the code, then i get the following error - 

`Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [03:06<00:00,  4.55s/files]
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6458670/6458670 [01:06<00:00, 96995.20 examples/s]
Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 1805.01it/s]
Traceback (most recent call last):
  File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 524, in <module>
    main(parse_args())
  File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 489, in main
    loader = build_dataloader(dataset=dataset, batch_size=512)
  File "/home/sandeep.pandey/m2/bert/src/convert_dataset.py", line 397, in build_dataloader
    num_workers = min(64, dataset.hf_dataset.n_shards)  # type: ignore
AttributeError: 'Dataset' object has no attribute 'n_shards'`

Please help to resolve this as I am stucked on reproducing the training pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use 'convert_dataset.py' to load data #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to use 'convert_dataset.py' to load data #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions