Hi everyone, I am trying to understand the behavior of torch.nn.DataParallel. The example code portion is given below for reference. Lets say I am using 8 batch size and two GPUs. Each GPU process 4 data samples. My questions are:
While updating the running means for batch_normalization, does this module update the mean back to original model by considering the whole batch size (like 8 batch size) or only updates on a specific device? In other words, if GPU:0 estimates mean ‘a’ value for a batch of 4 data-samples and GPU:1 estimates another mean value (say ‘b’), does pytorch updates the batch_normalization of model by taking mean of both ‘a’ and ‘b’ or two devices update the batch normalization mean independent of each other? I looked at the source code and documentation and source code but did not understand.
I saw in one of the comments where Soumith mentioned ‘if you notice the examples, DataParallel is not applied to the entire network + loss. It is only applied to part of the network.’ (How to use DataParallel in backward?). How does pytorch’s module nn.DataParallel decides which part of ‘network’ to send on GPU?
The documentation says ‘The parallelized :attr:
modulemust have its parameters and buffers on
device_idsbefore running this :class:
~torch.nn.DataParallelmodule.’ I was copying the model ‘net’ to GPU device after applying nn.DataParallel on ‘net’ (as shown below) and the model trains fine. I do not understand why is it compulsory to send the model to GPU before applying nn.DataParallel (as given in source code)?
net = nn.DataParallel(net) net = net.to(device) if torch.cuda.is_available(): net.cuda() softMax.cuda() CE_loss.cuda()