How does nn.DataParallel guarantee each replicated module is lying on each GPU?

I’m now trying to understand how nn.DataParallel use multiple GPUs.
As far as I understand, whenever I call forward function of module wrapped by DataParallel,

  1. Split inputs to multiple GPUs with scatter function
  2. Replicate original module to multiple GPUs
  3. Call forward of each replica with corresponding (splitted) inputs
  4. Gather outputs from each replica and return it

However, I couldn’t find how DataParallel guarantees sub-modules of each replica lying on certain cuda device.
In case of parameters and buffers of original module, it’s guaranteed that they are copied to every usable GPU (https://github.com/pytorch/pytorch/blob/0988bbad2de5e0ce403c5e6f781437b24a484fc2/torch/nn/parallel/replicate.py#L12, https://github.com/pytorch/pytorch/blob/0988bbad2de5e0ce403c5e6f781437b24a484fc2/torch/nn/parallel/replicate.py#L19), but I can’t find such code for sub-modules of original module.

Even though parallel_apply function runs the forward function of replica with torch.cuda.device(device of given inputs) context, I think it only affects where new tensors located, not the replica.

Could you tell me how each replica broadcasted on each GPU?

Move module to device x is same as move all the parameters and buffers moved to that device. So two lines mentioned in the question are sufficient to guarantee each module is replicated to different GPU.

It’s not true. When we don’t specify devices when constructing nn.DataParallel instance (which is common case), devices are set as list of None, and then become list of current device, for all replicated modules.

But it doesn’t matter since each i-th (module, inputs) pair is already located on the GPU #i. So the result of calculation is also going to be located on i-th GPU, which makes with torch.cuda.device(device) meaningless. And that’s why we need the gather after the parallel_apply.

Please let me know if there’s any wrong point.