I have a large audio dataset with about 1000 speakers and dozens of utterances per speaker. For the model I’m training I need to sample a largish batch (64) of speakers and then randomly sample 10 utterances per speaker.
I’ve created a
Dataset which indexes over the speakers with the
__getitem__ method lazily returning the 10 utterances. This works well but is slow between epochs when the dataloader needs to spawn new processes, etc. Since there are only 16 batches per epoch this quickly becomes a large amount of wasted time.
Any recommendations would be appreciated. Thanks!
Do you load anything in your
The spawning of the multiple workers shouldn’t create a bottleneck, if you lazily load the data in
I just load a list of files in the
__init__. I don’t think it is process spawning that’s causing the bottleneck. The call to
__getitem__ has to do a lot of IO so I think when the dataloader starts and none of the workers have prefetched any data it takes a while to start. In the middle of an epoch this isn’t a problem because some of the worker have already fetched data.
I am also suffering this issue. When I set num_workers to 0, this issue goes away. But with larger num_workers, the time between epochs is longer, especially when batchsize is larger like 256.
Did dataloader repawn processes everytime it runs out of data like when epoch ends?
Its spawning overhead takes more time than the train itself. I think it should be optimized.
Do you know what is the cause? I am also facing the same problem on linux. When num_workers is 0, it goes right to the next epoch. But when num_workers > 0, it takes few seconds to go to the next epoch. The larger the num_worker value, the slower the between epoch transition is.
Have a look at this topic for a potential explanation of this effect and a workaround.