DDP -Sync Batch Norm - Gradient Computation Modified?

soulslicer · May 25, 2020, 5:42pm

I am using DDP, and I am usiong BatchNorm in my network. If I do not set track_running_stats=False, then I get the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)

I tried to switch it to SyncBatchNorm, and I still get the same error.

ptrblck · May 26, 2020, 7:33am

Could you post your model definition (and if possible an executable code snippet) so that we could have a look, please?

soulslicer · May 26, 2020, 3:48pm

Okay, I have created a self runnable code:

https://pastebin.com/5QKKsfe6

https://pastebin.com/kD0fgPve

You need to download both the files to run it

soulslicer · May 26, 2020, 3:59pm

Okay, I realized if I remove the 2nd call, output_right = model(input_right), then it no longer gives this error. How can I make SyncBatchNorm work for two inputs to the same network that I want to constrain??

Works

output_left = model(input_left)
loss = torch.sum(output_left["output"][0] - 0)

Doesnt Work

output_left = model(input_left)
output_right = model(input_right)
loss = torch.sum(output_left["output"][0] - 0) + torch.sum(output_right["output"][0] - 0)

soulslicer · May 26, 2020, 4:05pm

This means I cannot call the model twice if I use DDP? I have to rewrite my code so that both input_left and input_right are passed into the model for computation

ptrblck · May 28, 2020, 2:09am

Based on this comment it seems that broadcasting might be seen as an inplace operation, so you might avoid this error by using broadcast_buffers=False.