When using DistributedDataParallel (DDP) to train a model with batch normalization, you may encounter the following error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Batch normalization seems to handle batches in an inplace manner, which can cause conflicts when using DDP. To avoid these conflicts, it is recommended to use SYNCBATCHNORM instead of batch normalization when using DDP. You can use nn.SyncBatchNorm.convert_sync_batchnorm
to convert the batch normalization layers to SYNCBATCHNORM layers.
However, even after converting the batch normalization layers, you may encounter the same error if the model is in eval()
mode. In this case, you need to set broadcast_buffers=False
in your model to avoid synchronization errors.
Is it possible to train a model with batch normalization using DDP while the model is in eval()
mode with broadcast_buffers=True
?
(I referred to this post for guidance.)