Load multiple Image files in dataloader to improve the storage I/O traffic

mahb324 · October 15, 2021, 12:30pm

I am trying to run pytorch using 100G dataset with 1M image size,
I wrote a mapstyle dataset to get the image from storage, I have a powerful system (96 core)
so I launched a dataset with 90 workers and 1024 batch size,
but my dataloader is still slow and when look at the i/o traffic does not go beyond 250M/sec
Seems like at the moment each worker only get one index and load one data point,
I tried passing sampler to fetch multiple data points but did not help the performance stays the same.

Can some one help me to make the worker fetch more files quickly and improve the performance of the dataloader?

ptrblck · October 15, 2021, 11:55pm

Depending on your SSD and its connection 250MB/sec might be expected.
Did you profile your SSD and if so what’s the max. read speed?

mahb324 · October 18, 2021, 8:42pm

I tried FIO on my SSD it is around 2G/sec throughput.
but when I go to the dataloader it decreases so much.
I want to know how can I load more image per each worker?
should I use map style or iterative style, to increase the worker reading load?

Thanks

ptrblck · October 18, 2021, 9:19pm

The difference between map style and iterative datasets is that the former uses indices to load samples while the latter can be used for streaming data.
You could search for 3rd party Python libraries yielding a high throughput and check if you could use them. The PyTorch Dataset and DataLoader are utilities to load samples, shuffle and batch them. The data loading itself can be done by any library which would be compatible with multiprocessing.
If you think that loading multiple samples in a single __getitem__ call might be beneficial, you could check the BatchSampler, which would pass a batch of indices to this method.