If I train on a single GPU with a batch size of b, do I need to divide this batch size by the number of GPUs available for training in DDP.
Yep, you can start with dividing the batch size. But depending on the loss function and whether each process is consuming the same number of samples per iteration, DDP may or may not give you the exactly same result as local training. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel
How can I calculate F1 score, Precision, Recall for a model being trained in DDP?
DDP does not change the behavior of the forward pass. So, these metrics can be calculated similar to local training. But since now the outputs and loss locate on multiple GPUs, you might need to gather/allgather them first if you need global numbers.
If I store local loss of two GPUs in two arrays. Is okay if I add them and divide by number of GPUs to get an average?
Similar to the 1st bullet, this depends on your loss function. If it’s sth like MSE, then yes, the average of two local loss should be the sane of the global one. But other loss functions might not have this property.
Hey @Saurav_Gupta1, the usage on torch.distributed.all_gather looks correct to me. One thing I wanna mention is that torch.distributed.all_gather is not an autograd function, so running backward on gathered tensors (i.e., tmp in your code) won’t reach the autograd graph prior to the all_gather operation. If you really need autograd to extend beyond comm ops, you can try this experimental autograd-powered all_gather.
My intension here is only to gather information regarding loss, accuracy and calculate F1 score, precision. I don’t want to run backward on gathered data. Am I gathering loss correctly? I am asking because in the above example tmp contains loss from GPU:0 but in some other run tmp contains loss of GPU:1
loss = criterion(output, targets)
tmp = [torch.empty_like(loss).cuda(rank) for _ in range(2)]
dist.all_gather(tmp, loss)
print(tmp)
loss.backward()
optimizer.step()
print(f'Model on GPU {rank}, Epoch: {epoch}--{i}/{len(dataloader)}], Loss: {loss.data.item()}')
Output:
[tensor(0.6518, device='cuda:0'), tensor(0.7940, device='cuda:0')]
[tensor(0.6518, device='cuda:1'), tensor(0.7940, device='cuda:1')]
Model on GPU 0, Epoch: 1--0/28125], Loss: 0.6518099904060364
Model on GPU 1, Epoch: 1--0/28125], Loss: 0.7939583659172058
[tensor(0.7865, device='cuda:1'), tensor(0.7719, device='cuda:1')]
[tensor(0.7865, device='cuda:0'), tensor(0.7719, device='cuda:0')]
Model on GPU 0, Epoch: 1--1/28125], Loss: 0.7865331172943115
Model on GPU 1, Epoch: 1--1/28125], Loss: 0.7718786001205444
[tensor(0.7348, device='cuda:1'), tensor(0.8238, device='cuda:1')]
[tensor(0.7348, device='cuda:0'), tensor(0.8238, device='cuda:0')]
tmp right now contains two tensors(loss from each GPU).
On a single GPU I train with a batch size of 32. Since, I have a very deep model and huge dataset I thought of using multiple GPUs in order to increase training speed. After reading this I came to know I need to divide my batch size and train model with a batch size of 16 for two GPUs. Gradient is computed on batch of 16 on each GPU and average of gradient is applied to the models which gives an effect as in one iteration a batch of 32 is processed by GPUs and gradient is applied. I also want to gather the loss which will be equivalent of this batch of 32.
Okay Let me try to rephrase my question, gather has given me loss calculated by each GPU. My batch size for model being trained on 2 GPUs is 16 which means at a time 32 images will be processed. The gradient step that was taken was with respect to 32 images. torch.distributed.gather returns loss with respect to 16 images on each GPU in a list. I want to get the loss with respect to 32 images that were processed in one iteration in total.
I see. In that case, DDP alone won’t be sufficient, as DDP’s output and loss are local to each process. If you only need to calculate the globally loss, one option is to gather the outputs instead of loss, and then calculated loss on the gathered outputs. If you also need back propagation from the global loss, there are at least two options:
combine DDP with RPC: Use DDP to compute local output and use RPC to collect them into one process and compute global loss, and then use distributed autograd to start the backward pass from global loss. This tutorial shows how to combine DDP with RPC.
Use the autograd-enabled collective communications in torch.distributed.nn.functional. Tests in this PR can serve as examples. (This feature is not officially released yet.)
I have another question is that what would be the best way to do all_gather when the batch size isn’t fixed? E.g. when drop_last=False in the Dataloader so the last batch size will be different?