The usage of nn.DataParallel

I have several questions about the usage of nn.DataParallel.

  1. What is the differences of the following two pipelines:
    (1) create module -> load weights -> data parallel
    (2) create module -> data parallel -> load weights
    Will the second one fail to load the weights to all the GPUs?

  2. In the following code, will the right model be saved?

model = create_some_module()
model_parallel = nn.DataParallel(model).cuda()
# do some training using model_parallel
torch.save(model.state_dict(), model_path)

In this code, I want to avoid saving model_parallel because the names of all the params saved will be preceded by a “module.”, which is annoying when loading and testing the models.

From the source code of nn.DataParallel, I found the following code:

def forward(self, *inputs, **kwargs):
    if not self.device_ids:
        return self.module(*inputs, **kwargs)
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
    if len(self.device_ids) == 1:
        return self.module(*inputs[0], **kwargs[0])
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
    outputs = self.parallel_apply(replicas, inputs, kwargs)
    return self.gather(outputs, self.output_device)

As far as I understand, every time when model_parallel.forward() is called, the params in model will be copied to each GPU first, then forward prop is performed on each GPU. If this is right, what I do to model after model_parallel = nn.DataParallel(model).cuda() is called will be done to the replicas in all GPUs. I wonder if this is right.

  1. What is the differences of the following codes:
    (1) Construct an optimizer with model: optimizer = nn.optim.SGD(model.params(), ...)
    (2) Construct an optimizer with model_parallel: optimizer = nn.optim.SGD(model_parallel.params(), ...)
    Will the first implementation perform the right optimization?

Thank you all.