Iterating batch in DataLoader takes more time after every nth iteration

My data is both image and the compressed(.npz) files that contain 3d points and labels for each 3d points(contains a minimum of 646464 points and each file size varies from 10MB to 45MB). num_workers=4, batch_size=32, points_sampled for every batch is 2048.

Total number of parameters: 13414465

train batch 1 load time 21.43259024620056s
train batch 2 load time 0.031423091888427734s
train batch 3 load time 0.004406452178955078
train batch 4 load time 0.004347562789916992
train batch 5 load time 18.13344931602478
train batch 6 load time 0.004399538040161133
train batch 7 load time 0.03353142738342285
train batch 8 load time 0.004467010498046875
train batch 9 load time 16.202253103256226
train batch 10 load time 0.8377358913421631

I understand that input file is too large and I load it from a network location, this could cause the slowdown.
Questions are…

  1. Why is it slow only after #num_workers iteration?
  2. How can we make it faster loading?

Thanks
Anil

  1. Each worker preloads a complete batch. If all start at the same time, they might finish close to each other and in your case 4 batches might be ready. Your actual model workload seems to be small in comparison to the data loading so that the training using these 4 batches finishes quickly. Meanwhile the workers already started to load the new batches, but cannot keep up with the model training so you have clearly a data loading bottleneck in your code.

  2. Have a look at this post for a general explanation and some advice.

1 Like

Thank you @ptrblck for the answer. Does this mean that using maximum possible num_workers is efficient? assuming some data loading bottleneck remains.

I think you would get the best performance using a “sane” number of workers, which depend on your overall system. E.g. 4 workers might work well for local workstations, but might not be enough for bigger servers.

1 Like

Experimented and understood, thank you!