I am trying to train neural network with Distributed Data Parallel. Each frame (not image) from dataset has a variable size and consist of several patches. Since optimization step should be done for each frame, gradient accumulation is need to be incorporated. In my DDP settings, batch_size=1 which means one frame per GPU and Adam optimizer with learning rate 1e-3 is used. After 15-20 epochs, training/validation loss starts to decrease significantly and then fails to converge.
From my additional test, one GPU case works well for the same script – training loss decreases monotonically. My basic guess right now is that the problem is connected with unavoidable sync between GPUs during gradient accumulation stage.
I’ve tested with different version of Pytorch (old 1.4.0 and newer 1.8.0)
Could you help me please? Thanks!
Here is my code:
(train.py)
with ddp_net.no_sync():
for patch_ind in range(patch_size-1):
# extract patch
patch_node_features = batch_node_features[patch_ind].cuda(non_blocking=True)
patch_gt = batch_gt[patch_ind].cuda(non_blocking=True)
patch_output = ddp_net(patch_node_features)
# loss calculation
train_loss = training_criterion(patch_output, patch_gt)
train_loss_mean += torch.Tensor.cpu(train_loss).detach().numpy()
train_loss = train_loss/train_node_num
train_loss.backward()
# plus last patch (sync)
patch_node_features = batch_node_features[patch_size-1].cuda(non_blocking=True)
patch_gt = batch_gt[patch_size-1].cuda(non_blocking=True)
patch_output = ddp_net(patch_node_features)
train_loss = training_criterion(patch_output, patch_gt)
train_loss_mean += torch.Tensor.cpu(train_loss).detach().numpy()
train_loss = train_loss/train_node_num
train_loss.backward() # with sync
# optimizer step
iteration_count += 1
optimizer.step()
ddp_net.zero_grad()