DDP,batchnorm,two forward error

dkz · April 19, 2024, 5:57pm

When using DDP, if there is a batch normalization (BN) layer in the network and only a single GPU is employed, an error will occur after two consecutive forward passes prior to backward propagation:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

However, this issue does not arise when using multiple GPUs. Why is this the case,?nn.SyncBatchNorm.convert_sync_batchnorm() has already been applied to convert batch normalization to synchronized batch normalization.

ptrblck · April 19, 2024, 9:50pm

I would assume DDP would detect a single GPU and would execute the single GPU run. However, I also don’t understand your use case in applying DDP on a single device.

ptrblck · April 20, 2024, 1:52pm

Your observation still does not explain your use case, so are you just trying random configs to see which one would fail?

dkz · April 20, 2024, 2:39pm

The overall code is like this

If I run torchrun train.py, an error occurs:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3]] is at version 3; expected version 2 instead.

If I run CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node 2 train.py, there is no error.