Slow training on each step of epoch when folder has a lot of images

farhanrahman · March 8, 2020, 3:07am

Hey,

So I noticed something when I was training a U-Net on my own data. I have 7000 images on one folder and 7000 ground truth tensors on another. I am using num_workers = 8. lets say if i am using a batch size of 4, I notice a pause in the epoch when its goes 32 steps. I notice another pause after 64. Each of these pauses make my training very slow.

Interestingly if I make my dataset way smaller, say I keep 100 images on the dataset folder the training gets a lot fast. Say for example training with 100 images would take 10 secs for a single epoch. So you would think if the training set had 7000 images, it would take somewhere around ~ (10 x 7000)/100 seconds which is roughly 12 mins. Whereas in reality it requires around 35 mins to complete the epoch.

Any insight or solution for this would be helpful. Thanks!