Module.state_dict() is wrong when using DataParallel

LuChengTHU · July 30, 2020, 3:08pm

I have a module like this:

class Block(nn.Module):
    def __init__(self, net):
        super(Block, self).__init__()
        self.net = net
        self.net_copy = copy.deepcopy(net)

    def forward(self, x):
        self.net_copy.load_state_dict(self.net.state_dict())
        return self.net(x)

The net is an nn.Sequential() module. When I use Pytorch>=1.5 and use nn.DataParallel in multi-GPUs, It shows that net_copy.state_dict().keys() is different with net.state_dict().keys(). However, when I use Pytorch==1.4 or single-GPU, this problem doesn’t appear. How can I make sure that net and net_copy is exactly the same?

mrshenli · July 30, 2020, 4:18pm

This is probably due to this PR: https://github.com/pytorch/pytorch/pull/33907

In v1.5, parameters on replicated models are no longer considered as leaves, as they shouldn’t be. If you really need to access those replicated parameters, you probably can get them from _former_parameters and manually add them into the stat_dict?

github.com

pytorch/pytorch/blob/c93e96fbd9903e576c6c1aa2fe12d8d548ae2d5b/torch/nn/parallel/replicate.py#L148


            replica._parameters[key] = None
    else:
        param_idx = param_indices[param]
        for j in range(num_replicas):
            replica = module_copies[j][i]
            param = param_copies[j][param_idx]
            # parameters in replicas are no longer leaves,
            # so setattr them as non-parameter attributes
            setattr(replica, key, param)
            # expose the parameter for DDP
            replica._former_parameters[key] = param
for key, buf in module._buffers.items():
    if buf is None:
        for j in range(num_replicas):
            replica = module_copies[j][i]
            replica._buffers[key] = None
    else:
        if buf.requires_grad and not detach:
            buffer_copies = buffer_copies_rg
            buffer_idx = buffer_indices_rg[buf]
        else:

cc @ngimel please correct me if I am wrong. And any thoughts on whether we should make state_dict() consistent between v1.4 vs v1.5?

aashaka · October 17, 2020, 1:31pm

In order to access the _former_parameters, we would need to access replica, right? Can you help me figure out how to access _former_parameters in OP’s example?
Or how to recreate state dict in some other manner?

mrshenli · October 18, 2020, 3:48pm

Hey @aashaka

Below is the implementation of the DataParallel.forward method. It basically calls replicas[i].forward(inputs[i], ...). So during execution, the self variable in the forward function is the replica. Hence, you can use self._former_parameters to access the field in forward function.

github.com

pytorch/pytorch/blob/c3466dabaae9328b207804afb043b7b519f64825/torch/nn/parallel/data_parallel.py#L147-L162


def forward(self, *inputs, **kwargs):
    if not self.device_ids:
        return self.module(*inputs, **kwargs)

    for t in chain(self.module.parameters(), self.module.buffers()):
        if t.device != self.src_device_obj:
            raise RuntimeError("module must have its parameters and buffers "
                               "on device {} (device_ids[0]) but found one of "
                               "them on device: {}".format(self.src_device_obj, t.device))

    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
    if len(self.device_ids) == 1:
        return self.module(*inputs[0], **kwargs[0])
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
    outputs = self.parallel_apply(replicas, inputs, kwargs)
    return self.gather(outputs, self.output_device)

aashaka · October 19, 2020, 1:48pm

I managed to recreate the state_dict using code similar to state_dict. Thanks for your help.

I noticed that _former_parameters exists in 1.5.1 but not in 1.5.0. It seems tricky to get the parameters in 1.5.0 if we do not know the names of the parameters in advance (but still possible since we are setting attr). Any suggestions for this?

mrshenli · October 19, 2020, 2:29pm

Hey @aashaka, yep, we added _former_parameters after v1.5 to fix the regression caused on https://github.com/pytorch/pytorch/pull/33907.

If this has become very inconvenient for you, I would suggest switch to DistributedDataParallel. There are more discussions here: https://github.com/pytorch/pytorch/issues/36268

aashaka · October 20, 2020, 2:50am

Thanks a lot. I have one last question. Like the OP, I need to recreate the state dict every time in the forward pass. I see about 8x increase in training time when compared to original PyTorch DataParallel. Any ideas why this might be the case?

def create_state_dict_new(main_module):
    state_dict_data = OrderedDict()

    def state_dict_recursion(this_module, state_dict_data, prefix=''):
        if hasattr(this_module,"_former_parameters"):
            for name, param in this_module._former_parameters.items():
                if param is not None:
                    state_dict_data[prefix + name] = param
        for name, buf in this_module._buffers.items():
            if buf is not None:
                state_dict_data[prefix + name] = buf
        for name, module in this_module._modules.items():
            if module is not None:
                state_dict_recursion(module, state_dict_data, prefix + name + '.')
    state_dict_recursion(main_module._modules['model'], state_dict_data)
    return state_dict_data

class ModelWrapper(torch.nn.Module):
    def __init__(self, model):
        super(ModelWrapper, self).__init__()
        self.model = model
    def forward(self, x):
        state_list = create_state_dict_new(self)
        return model(x)

model = torch.nn.DataParallel(ModelWrapper(model))

mrshenli · October 20, 2020, 3:44pm

Could you please measure the time spent on the create_state_dict_new?

The forward function will be launched in each thread. If you have 4 GPUs, it means that there will be 4 threads executing that create_state_dict_new independently. However, due to Python GIL, the 4 threads cannot run the function concurrently, which would further exacerbate the delay.