Guidelines for assigning num_workers to DataLoader

I recall Jeremy Howard saying that large batches are faster because the GPU kernel rearranges itself before each batch (loads data, and possibly resets other parameters?), and that takes time. For example, if you were running 768 MNIST images, if you used a batch size of 1 the GPU would have to set up the kernel 768 times for the computation, whereas if you used a batch size of 64, then it would only restart 12 times.