Manually reduce gradients from multiple GPUs encounter unexpected crash

Xiaopeng_Li · July 10, 2020, 4:47pm

Hi, I’m trying to implement a customized DataParallel, where I want to manually reduce gradients from replicas in multiple GPUs. The reason I’m doing it is that I want each GPU to accumulate gradients for several iterations before doing one reduce gradients operation across multi-GPUs, hence reducing the communication overhead. When using 2 GPUs it seems to work. However, when using 4 GPUs after some iteration the program simply crashes without any error message. I think it’s some lower level C code crashes. Do you have any idea why? Here is my code:
`class DataParallelAccumulation(DataParallel):
def init(self, module, device_ids=None, output_device=None, dim=0):
super().init(module, device_ids=device_ids, output_device=output_device, dim=dim)
if len(self.device_ids) > 1:
self.replicas = self.replicate(self.module, self.device_ids, detach=True)

def forward(self, *inputs, **kwargs):
    if not self.device_ids:
        return self.module(*inputs, **kwargs)
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
    if len(self.device_ids) == 1:
        return self.module(*inputs[0], **kwargs[0])
    outputs = self.parallel_apply(self.replicas[:len(inputs)], inputs, kwargs)
    return outputs

def reduce_grads(self):
    if len(self.device_ids) > 1:
        for parameters in zip(self.module.parameters(), *[r.parameters() for r in self.replicas]):
            destination_device = parameters[0].get_device()
            parameters[0].grad = (comm.reduce_add([p.grad for p in parameters[1:]],
                destination=destination_device))

def synchronize(self):
    if len(self.device_ids) > 1:
        self.replicas = self.replicate(self.module, self.device_ids, detach=True)

def replicate(self, module, device_ids, detach=False):
    replicas = replicate(module, device_ids, detach=detach)
    return replicas

`

mrshenli · July 10, 2020, 5:33pm

Hey @Xiaopeng_Li, which version of PyTorch are you using? After https://github.com/pytorch/pytorch/pull/33907, the replicate method is no longer supposed to be used this way. It will only replicate non-leaf parameters, and as a result, replicated_model.parameters() will be empty. Can you double check if this is the case in your dev env?

If you really need this, you can try to access the _former_parameters attribute in replicated models. See the code below. But there is no guarantee on how long this attribute can stay.

github.com

pytorch/pytorch/blob/0edbe6b063d2525ceaf89f0c603f6e35b3118686/torch/nn/parallel/distributed.py#L347-L357


def parameters(m, recurse=True):
    def model_parameters(m):
        ps = m._former_parameters.values() \
            if hasattr(m, "_former_parameters") \
            else m.parameters(recurse=False)
        for p in ps:
            yield p

    for m in m.modules() if recurse else [m]:
        for p in model_parameters(m):
            yield p

Xiaopeng_Li · July 10, 2020, 5:39pm

@mrshenli Thanks for the reply. I’m using 1.4.0. In this version the replicated_model.parameters() do exist, and I understand that in 1.5.0 it no longer exist.
As I mentioned, it works for several iteration of training, but crashes at some point. Every time it crashes at different iterations, pretty weired. Am I doing reduce_grads correctly?

Xiaopeng_Li · July 11, 2020, 5:00am

Close this issue because similar issue is already discussed in pull-19577 and pull-21736, and the PyTorch team have worked out the solution, that is using no_sync() context manager.