I’m now trying to understand how nn.DataParallel use multiple GPUs.
As far as I understand, whenever I call forward function of module wrapped by DataParallel,
- Split inputs to multiple GPUs with scatter function
- Replicate original module to multiple GPUs
- Call forward of each replica with corresponding (splitted) inputs
- Gather outputs from each replica and return it
However, I couldn’t find how DataParallel guarantees sub-modules of each replica lying on certain cuda device.
In case of parameters and buffers of original module, it’s guaranteed that they are copied to every usable GPU (https://github.com/pytorch/pytorch/blob/0988bbad2de5e0ce403c5e6f781437b24a484fc2/torch/nn/parallel/replicate.py#L12, https://github.com/pytorch/pytorch/blob/0988bbad2de5e0ce403c5e6f781437b24a484fc2/torch/nn/parallel/replicate.py#L19), but I can’t find such code for sub-modules of original module.
Even though parallel_apply function runs the forward function of replica with torch.cuda.device(device of given inputs) context, I think it only affects where new tensors located, not the replica.
Could you tell me how each replica broadcasted on each GPU?