I am looking to modify a pretrained resnet with a non inplace version of batchnorm2D since this is causing me problems when I run in distributed mode (runtimerror saying that grad computation has not been possible due to inplace operation)
Error detected in CudnnBatchNormBackward
I instantiate the model the vanilla way, no magic:
Is there a known solution to this problem?
Batchnorm layers don’t have an
inplace argument, so could you post an executable code snippet which reproduces this issue for further debugging as well as the output of
python -m torch.utils.collect_env?
thank you @ptrblck I have managed to solve this by setting
broadcast_buffers=False It turns out that having it as True is an inplace operation.
but now, I have a massive memory leak issue which I have described here: