Relation between num_workers, batch_size and epoch in DataLoader?

I have a question related to a relation between num_workers in DataLoader to batch_size and Epoch number.

E.g.

  1. Lets assume we have total training size = 2000, and we use batch_size of 20 and we use 10 num_workers in DataLoader. Does this mean that DataLoader will return 20*10 = 200 examples at parallel which consist of 20 examples from each of the individual DataLoader’s workers ?

Or

  1. DataLoader returns 20 examples sequentially from each worker and finish 10 epochs at parallel after 10(number of workers) * 100 (iterations to complete one epoch) = 1000 iterations

Or

  1. Something else ? Kindly explain.

If first one is correct than while calculating the loss function does it aggregate across those 200 examples?
If second one is correct than it means that the loss will be calculated for 20 examples for 1st worker and than 2nd worker and than 3rd and so on until the end. My question here is that if it is calculating loss for 20 examples from each of the worker before proceeding onto the next batch, would there be any concern that model can end up in some totally different minima after convergence than it would have converged when we just use single worker ? Because obviously model is taking different steps towards minima in batch gradient descent when we use more number of workers as compared to single worker.

If my understanding wrong, I would be extremely thankful if someone clarifies.

2 Likes

num_workers is not related to batch_size. Say you set batch_size to 20 and the training size is 2000, then each epoch would contain 100 iterations, i.e. for each iteration, the data loader returns a batch of 20 instances. num_workers > 0 is used to preprocess batches of data so that the next batch is ready for use when the current batch has been finished. More num_workers would consume more memory usage but is helpful to speed up the I/O process. Please refer to this thread for more discussions on this problem.

21 Likes