nn.DataParallel and batch size

nn.DataParallel divides batch data and feed forward for each gpu.
If so, is the result of (1 gpu , batch_size=16, 4 iteration) the same as the result of (4 gpu, batch_size=16*4, 1 iteration) ?

Not at all. First of all, epochs refer to passes over the whole training set; if you only iterate once instead of 4 it means the network only sees each example once instead of four times. This clearly will produce different results.

With the same number of epochs, it is still different. The problem is that changing the batch size changes each SGD step; in fact, batch size affects the variance in the estimation of your gradient. There is consensus among the community that the batch size should not be too large nor too small.

Bottom line is that if you multiply your batch size by 200, expect to see worse results. Unless you use some tricks; a good read on the subject is “Accurate, Large Minibatch SGD - Training Imagenet in 1 hour”

1 Like