I am using DDP, and I am usiong BatchNorm in my network. If I do not set track_running_stats=False, then I get the following error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)
I tried to switch it to SyncBatchNorm, and I still get the same error.
Could you post your model definition (and if possible an executable code snippet) so that we could have a look, please?
Okay, I have created a self runnable code:
https://pastebin.com/5QKKsfe6
https://pastebin.com/kD0fgPve
You need to download both the files to run it
Okay, I realized if I remove the 2nd call, output_right = model(input_right)
, then it no longer gives this error. How can I make SyncBatchNorm work for two inputs to the same network that I want to constrain??
Works
output_left = model(input_left)
loss = torch.sum(output_left["output"][0] - 0)
Doesnt Work
output_left = model(input_left)
output_right = model(input_right)
loss = torch.sum(output_left["output"][0] - 0) + torch.sum(output_right["output"][0] - 0)
This means I cannot call the model twice if I use DDP? I have to rewrite my code so that both input_left and input_right are passed into the model for computation
Based on this comment it seems that broadcasting might be seen as an inplace operation, so you might avoid this error by using broadcast_buffers=False
.
1 Like