I have several questions about the usage of nn.DataParallel
.
-
What is the differences of the following two pipelines:
(1) create module -> load weights -> data parallel
(2) create module -> data parallel -> load weights
Will the second one fail to load the weights to all the GPUs? -
In the following code, will the right model be saved?
model = create_some_module()
model_parallel = nn.DataParallel(model).cuda()
# do some training using model_parallel
torch.save(model.state_dict(), model_path)
In this code, I want to avoid saving model_parallel
because the names of all the params saved will be preceded by a “module.”, which is annoying when loading and testing the models.
From the source code of nn.DataParallel
, I found the following code:
def forward(self, *inputs, **kwargs):
if not self.device_ids:
return self.module(*inputs, **kwargs)
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
if len(self.device_ids) == 1:
return self.module(*inputs[0], **kwargs[0])
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
outputs = self.parallel_apply(replicas, inputs, kwargs)
return self.gather(outputs, self.output_device)
As far as I understand, every time when model_parallel.forward()
is called, the params in model
will be copied to each GPU first, then forward prop is performed on each GPU. If this is right, what I do to model
after model_parallel = nn.DataParallel(model).cuda()
is called will be done to the replicas in all GPUs. I wonder if this is right.
- What is the differences of the following codes:
(1) Construct an optimizer withmodel
:optimizer = nn.optim.SGD(model.params(), ...)
(2) Construct an optimizer withmodel_parallel
:optimizer = nn.optim.SGD(model_parallel.params(), ...)
Will the first implementation perform the right optimization?
Thank you all.