Dataloader returns data whenever any data is ready

amsword · May 26, 2021, 4:02am

The order of the data is determined by the sampler. Is there a way to make the order determined by when the data is ready? That is, each worker processes the data independently, and the data loader returns the data which are ready rather than waiting for some data (indexed by the sampler)

The reason: in my application, I find different data needs different amount of computational time. The training becomes slow (2~4 times slower) lagged by the data which requires more time.

eqy · May 26, 2021, 5:56am

Can you work around this by more aggressively prefetching? (e.g., increase prefetch_factor when creating your dataloader)

amsword · May 26, 2021, 5:55pm

Thanks for your reply. Before, i had tried to increase this and encountered insufficient shared memory issues.

eqy · May 26, 2021, 5:59pm

In many cases I find it can be easier to improve the data loading process rather than to change dataloader behavior itself; changing the order of samples based on loading time might also introduce some unwanted bias in the training. Can you share some more details about the data loading computation so that potential bottlenecks can be mitigated?

amsword · June 8, 2021, 4:53pm

thanks for the reply. Eventually, i find some images are too large and the problem is solved by resizing the image and resaving them to a lower-quality version. Perhaps, the disk I/O is one issue or needs to be improved.