Using spectral_norm with DistributedDataParallel makes backward() fail

Hi!

My training works if I use mutiple GPUs on a single machine (i.e. DataParallel). However, if I try to train on two machines (by using DistributedDataParallel), I get the following error on .backward():

one of the variables needed for gradient computation has been modified by
an inplace operation: [torch.cuda.FloatTensor [256, 256, 5, 5]] is at version 2;
expected version 1 instead. Hint: the backtrace further above shows the
operation that failed to compute its gradient. The variable in question was 
changed in there or anywhere later. Good luck!

torch.autograd.set_detect_anomaly(True) points me to spectral_norm's code that updates the weight:

  File ".../python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__                                                                                                                  
    hook(self, input)                                                                                                                                                                                                                        
  File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 99, in __call__                                                                                                              
    setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training))                                                                                                                                              
  File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 86, in compute_weight                                                                                                        
    weight = weight / sigma 

This definitely does not look like an inplace operation.
The same error occurs even if I use DistributedDataParallel for a single machine.

Any suggestions or ideas are more than welcome. Thanks in advance.

Versions

PyTorch: 1.1.0
CUDA: 9.0.176

Hey @bornabesic,

  1. Can you try if setting broadcast_buffers=False in DistributedDataParallel constructor works for you?
  2. If not, can you try PyTorch v1.4?
  3. If it still does not work, could you please provide code for minimum repro? Thanks!

@mrshenli

The problem happens regardless of the value of broadcast_buffers.

I managed to narrow down my search for the source of the problem.
The error occurs if I use multiple GPUs per machine AND multiple forward passes of the module before backward().
Otherwise, it works just fine.

multiple forward passes of the module before backward() .

Do you mean running multiple forward pass on the same DDP instance before launching the backward pass? If so, it is expected to hit errors due to prepare_for_backward. But I would expect it throws a different error though. A work around for this is to wrap your multiple forward pass into one YourModule.forward function, and then use DDP to wrap YourModule.