I’m trying to use
torch.nn.SyncBatchNorm.convert_sync_batchnorm in my DDP model. I am currently able to train with DDP no problem while using mixed-precision with
torch.cuda.amp.autocast but it is not working with
torch.nn.SyncBatchNorm. I am running PyTorch=1.8.1 and python 3.8 with Cuda=10.2. Here is how I am setting up the model.
net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net) net = net.to(device) net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[rank], find_unused_parameters=False) optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate) scaler = GradScaler() for epoch in range(starting_epoch, epochs): for idx, batch in enumerate(train_loader): with autocast(): pred = net(batch['data']) loss = loss_fn(pred, batch['target']) for param in net.parameters(): param.grad = None scaler.scale(loss).backward()
This works no problem training when training normally, but when adding in
torch.nn.SyncBatchNorm I am getting the error
File "/home/.conda/envs/main_env_2/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 545, in forward return sync_batch_norm.apply( File "/home/.conda/envs/main_env_2/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 38, in forward mean, invstd = torch.batch_norm_gather_stats_with_counts( RuntimeError: expected scalar type Half but found Float
I also tried wrapping the
autocast, but it did not work.