Using spectral_norm with DistributedDataParallel makes backward() fail

bornabesic · April 4, 2020, 5:56pm

Hi!

My training works if I use mutiple GPUs on a single machine (i.e. DataParallel). However, if I try to train on two machines (by using DistributedDataParallel), I get the following error on .backward():

one of the variables needed for gradient computation has been modified by
an inplace operation: [torch.cuda.FloatTensor [256, 256, 5, 5]] is at version 2;
expected version 1 instead. Hint: the backtrace further above shows the
operation that failed to compute its gradient. The variable in question was 
changed in there or anywhere later. Good luck!

torch.autograd.set_detect_anomaly(True) points me to spectral_norm's code that updates the weight:

  File ".../python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__                                                                                                                  
    hook(self, input)                                                                                                                                                                                                                        
  File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 99, in __call__                                                                                                              
    setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training))                                                                                                                                              
  File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 86, in compute_weight                                                                                                        
    weight = weight / sigma

This definitely does not look like an inplace operation.
The same error occurs even if I use DistributedDataParallel for a single machine.

Any suggestions or ideas are more than welcome. Thanks in advance.

Versions

PyTorch: 1.1.0
CUDA: 9.0.176

mrshenli · April 5, 2020, 4:08pm

Hey @bornabesic,

Can you try if setting broadcast_buffers=False in DistributedDataParallel constructor works for you?
If not, can you try PyTorch v1.4?
If it still does not work, could you please provide code for minimum repro? Thanks!

bornabesic · April 5, 2020, 9:15pm

@mrshenli

The problem happens regardless of the value of broadcast_buffers.

I managed to narrow down my search for the source of the problem.
The error occurs if I use multiple GPUs per machine AND multiple forward passes of the module before backward().
Otherwise, it works just fine.

mrshenli · April 6, 2020, 2:39am

multiple forward passes of the module before backward() .

Do you mean running multiple forward pass on the same DDP instance before launching the backward pass? If so, it is expected to hit errors due to prepare_for_backward. But I would expect it throws a different error though. A work around for this is to wrap your multiple forward pass into one YourModule.forward function, and then use DDP to wrap YourModule.