DDP, convert_sync_batchnorm and eval()

spnova12 · April 6, 2023, 5:35am

When using DistributedDataParallel (DDP) to train a model with batch normalization, you may encounter the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Batch normalization seems to handle batches in an inplace manner, which can cause conflicts when using DDP. To avoid these conflicts, it is recommended to use SYNCBATCHNORM instead of batch normalization when using DDP. You can use nn.SyncBatchNorm.convert_sync_batchnorm to convert the batch normalization layers to SYNCBATCHNORM layers.

However, even after converting the batch normalization layers, you may encounter the same error if the model is in eval() mode. In this case, you need to set broadcast_buffers=False in your model to avoid synchronization errors.

Is it possible to train a model with batch normalization using DDP while the model is in eval() mode with broadcast_buffers=True?

(I referred to this post for guidance.)

wanchaol · April 11, 2023, 5:53am

if your model is in eval() model, then you are not suppose to train the model, you might need to put your model under trainI() model first and a lot of nn.Module relies on eval or train to determine whether their parameters need to record gradient or not (i.e. batch norm layer as you observed).

can you describe why you want to train the model under eval mode?

spnova12 · April 11, 2023, 6:25am

I am using batch normalization in my model, but I don’t want to train it from the beginning.

Instead, I would like to activate the batch normalization training after some initial training. Training batch normalization using the train() mode means that the network learns to satisfy the distribution of the entire batch, which can be challenging during the initial training phase.

Therefore, I have been using eval() mode to suppress batch normalization’s ability to see the entire batch during the initial training phase. I want to continue using eval() mode during training so that I can activate batch normalization training later in the process when the training has become more stable.