Code:
loss = criterion(output, targets)
tmp = [torch.empty_like(loss).cuda(rank) for _ in range(2)]
dist.all_gather(tmp, loss)
print(tmp)
loss.backward()
optimizer.step()
print(f'Model on GPU {rank}, Epoch: {epoch}--{i}/{len(dataloader)}], Loss: {loss.data.item()}')
Output:
[tensor(0.6518, device='cuda:0'), tensor(0.7940, device='cuda:0')]
[tensor(0.6518, device='cuda:1'), tensor(0.7940, device='cuda:1')]
Model on GPU 0, Epoch: 1--0/28125], Loss: 0.6518099904060364
Model on GPU 1, Epoch: 1--0/28125], Loss: 0.7939583659172058
[tensor(0.7865, device='cuda:1'), tensor(0.7719, device='cuda:1')]
[tensor(0.7865, device='cuda:0'), tensor(0.7719, device='cuda:0')]
Model on GPU 0, Epoch: 1--1/28125], Loss: 0.7865331172943115
Model on GPU 1, Epoch: 1--1/28125], Loss: 0.7718786001205444
[tensor(0.7348, device='cuda:1'), tensor(0.8238, device='cuda:1')]
[tensor(0.7348, device='cuda:0'), tensor(0.8238, device='cuda:0')]
tmp right now contains two tensors(loss from each GPU).
On a single GPU I train with a batch size of 32. Since, I have a very deep model and huge dataset I thought of using multiple GPUs in order to increase training speed. After reading this I came to know I need to divide my batch size and train model with a batch size of 16 for two GPUs. Gradient is computed on batch of 16 on each GPU and average of gradient is applied to the models which gives an effect as in one iteration a batch of 32 is processed by GPUs and gradient is applied. I also want to gather the loss which will be equivalent of this batch of 32.