Hi everyone, I am trying to understand the behavior of torch.nn.DataParallel. The example code portion is given below for reference. Lets say I am using 8 batch size and two GPUs. Each GPU process 4 data samples. My questions are:
-
While updating the running means for batch_normalization, does this module update the mean back to original model by considering the whole batch size (like 8 batch size) or only updates on a specific device? In other words, if GPU:0 estimates mean ‘a’ value for a batch of 4 data-samples and GPU:1 estimates another mean value (say ‘b’), does pytorch updates the batch_normalization of model by taking mean of both ‘a’ and ‘b’ or two devices update the batch normalization mean independent of each other? I looked at the source code and documentation and source code but did not understand.
-
I saw in one of the comments where Soumith mentioned ‘if you notice the examples, DataParallel is not applied to the entire network + loss. It is only applied to part of the network.’ (How to use DataParallel in backward?). How does pytorch’s module nn.DataParallel decides which part of ‘network’ to send on GPU?
-
The documentation says ‘The parallelized :attr:
module
must have its parameters and buffers on
device_ids[0]
before running this :class:~torch.nn.DataParallel
module.’ I was copying the model ‘net’ to GPU device after applying nn.DataParallel on ‘net’ (as shown below) and the model trains fine. I do not understand why is it compulsory to send the model to GPU before applying nn.DataParallel (as given in source code)?
net = nn.DataParallel(net)
net = net.to(device)
if torch.cuda.is_available():
net.cuda()
softMax.cuda()
CE_loss.cuda()