ParameterList assigned to 1 GPU only (?)

ortho-stice · December 12, 2019, 10:45am

Hey folks,

I am new to pytorch and I am trying to parallelize my network. Using nn.DataParallel seems to work as expected for the nn.modules living inside my class, however, it looks like the nn.ParameterLists that I’m defining as class members are listed as sitting in (GPU 0) only, when I print out the module’s parameters:

Is this expected behaviour and why are they not listed on both of the GPUs I’m using? Could somebody please explain what is going on here?

torch.cuda.device_count returns 2 as expected.

My code looks something like the following:

class Network(nn.Module):
    def __init__(self):
        ...
        self.templates = nn.ModuleList([nn.ParameterList([nn.Parameter(template_init, requires_grad=True) for i in range(n)]) for n in self.num_t])

...

self.Network = nn.DataParallel(self.Network)
self.Network.to(self.device)

mrshenli · December 26, 2019, 4:21pm

Hi @ortho-stice

This is expected behavior. Here is the source code of DataParallel: https://github.com/pytorch/pytorch/blob/46539eee0363e25ce5eb408c85cefd808cd6f878/torch/nn/parallel/data_parallel.py#L148-L153

What happens is that, in every forward pass, DataParallel will

scatters the input to all GPUs
replicate the model to all GPUs
launch parallel_apply so that every GPU will run its own forward pass using its input data split in parallel.
gather all outputs to the output device

So the model replication only occurs in the forward pass, and hence you won’t see those model replicas outside the forward function.

BTW, we do recommend using DistributedDataParallel which only replicates the model once in constructor instead of in every forward invocation.