Dataloader slow between epochs

I have a large audio dataset with about 1000 speakers and dozens of utterances per speaker. For the model I’m training I need to sample a largish batch (64) of speakers and then randomly sample 10 utterances per speaker.

I’ve created a Dataset which indexes over the speakers with the __getitem__ method lazily returning the 10 utterances. This works well but is slow between epochs when the dataloader needs to spawn new processes, etc. Since there are only 16 batches per epoch this quickly becomes a large amount of wasted time.

Any recommendations would be appreciated. Thanks!

Do you load anything in your Dataset's __init__ method?
The spawning of the multiple workers shouldn’t create a bottleneck, if you lazily load the data in __getitem__.

I just load a list of files in the __init__. I don’t think it is process spawning that’s causing the bottleneck. The call to __getitem__ has to do a lot of IO so I think when the dataloader starts and none of the workers have prefetched any data it takes a while to start. In the middle of an epoch this isn’t a problem because some of the worker have already fetched data.

I am also suffering this issue. When I set num_workers to 0, this issue goes away. But with larger num_workers, the time between epochs is longer, especially when batchsize is larger like 256.
Did dataloader repawn processes everytime it runs out of data like when epoch ends?
Its spawning overhead takes more time than the train itself. I think it should be optimized.

Do you know what is the cause? I am also facing the same problem on linux. When num_workers is 0, it goes right to the next epoch. But when num_workers > 0, it takes few seconds to go to the next epoch. The larger the num_worker value, the slower the between epoch transition is.

Have a look at this topic for a potential explanation of this effect and a workaround. :slight_smile:

1 Like