Backward error caused by batch norm in 1.5.1

Red_Lv · July 25, 2020, 12:25pm

Source code: https://github.com/agrimgupta92/sgan
When I set batch_norm to true, and here comes the backward error.
Warning: Error detected in CudnnBatchNormBackward. Traceback of forward call that caused the error:
File “train_dist.py”, line 644, in
main(args)
File “train_dist.py”, line 282, in main
optimizer_d)
File “train_dist.py”, line 433, in discriminator_step
scores_fake = discriminator(traj_fake, traj_fake_rel, seq_start_end)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 458, in forward
output = self.module(*inputs[0], **kwargs[0])
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/caros/sgan-master/sgan/models.py”, line 619, in forward
scores = self.real_classifier(classifier_input)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py”, line 100, in forward
input = module(input)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py”, line 106, in forward
exponential_average_factor, self.eps)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py”, line 1923, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
(print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:60)
Traceback (most recent call last):
File “train_dist.py”, line 644, in
main(args)
File “train_dist.py”, line 282, in main
optimizer_d)
File “train_dist.py”, line 451, in discriminator_step
data_loss.backward()
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/tensor.py”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/caros/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py”, line 102, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Junochiu · October 27, 2020, 2:38pm

Hi @Red_Lv, is your problem solved? I am running into the same situation as you did. I will really appreciate if you could share how your problem was solved

Youwei_Liang · December 19, 2020, 5:00pm

I encountered the same error when testing with DistributedDataParallel and convert_sync_batchnorm. I got the same error when testing with 1 GPU in my local machine. I found using more GPUs would resolve the error.