How to calculate F1 score, Precision in DDP

Hi, I am new to the concept of DDP. I am currently training my model on two GPUs.

  1. If I train on a single GPU with a batch size of b, do I need to divide this batch size by the number of GPUs available for training in DDP.

  2. How can I calculate F1 score, Precision, Recall for a model being trained in DDP?

  3. If I store local loss of two GPUs in two arrays. Is okay if I add them and divide by number of GPUs to get an average?

  1. If I train on a single GPU with a batch size of b, do I need to divide this batch size by the number of GPUs available for training in DDP.

Yep, you can start with dividing the batch size. But depending on the loss function and whether each process is consuming the same number of samples per iteration, DDP may or may not give you the exactly same result as local training. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel

  1. How can I calculate F1 score, Precision, Recall for a model being trained in DDP?

DDP does not change the behavior of the forward pass. So, these metrics can be calculated similar to local training. But since now the outputs and loss locate on multiple GPUs, you might need to gather/allgather them first if you need global numbers.

  1. If I store local loss of two GPUs in two arrays. Is okay if I add them and divide by number of GPUs to get an average?

Similar to the 1st bullet, this depends on your loss function. If it’s sth like MSE, then yes, the average of two local loss should be the sane of the global one. But other loss functions might not have this property.

Since I am using multiple GPUs with nccl backend gather does not work so I tried using all_gather and below is a part of my code

code:

loss = criterion(output, targets)
tmp = [torch.empty_like(loss).cuda(rank) for _ in range(2)]
dist.all_gather(tmp, loss)
print(tmp[0].data.item())
loss.backward()
optimizer.step()
print(f'Model on GPU {rank}, Epoch: {epoch}--{i}/{len(dataloader)}],  Loss: {loss.data.item()}')

output:

0.9214748740196228
0.9214748740196228
Model on GPU 0, Epoch: 1--0/28125],  Loss: 0.9214748740196228,  Batch_time_Average - 4.840
Model on GPU 1, Epoch: 1--0/28125],  Loss: 0.9064291715621948,  Batch_time_Average - 4.781
0.7848501801490784
0.7848501801490784
Model on GPU 0, Epoch: 1--1/28125],  Loss: 0.7848501801490784,  Batch_time_Average - 6.893
Model on GPU 1, Epoch: 1--1/28125],  Loss: 0.6567432880401611,  Batch_time_Average - 6.798
0.7931838035583496
0.7931838035583496
Model on GPU 0, Epoch: 1--2/28125],  Loss: 0.7931838035583496,  Batch_time_Average - 8.924
Model on GPU 1, Epoch: 1--2/28125],  Loss: 0.825346052646637,  Batch_time_Average - 8.835
1.0175780057907104
1.0175780057907104
Model on GPU 0, Epoch: 1--3/28125],  Loss: 1.0175780057907104,  Batch_time_Average - 10.966
Model on GPU 1, Epoch: 1--3/28125],  Loss: 0.5258045196533203,  Batch_time_Average - 10.868

tmp is just printing loss value at GPU 0. I am using Categorical cross entropy. I am not sure if this is the required output.

Hey @Saurav_Gupta1, the usage on torch.distributed.all_gather looks correct to me. One thing I wanna mention is that torch.distributed.all_gather is not an autograd function, so running backward on gathered tensors (i.e., tmp in your code) won’t reach the autograd graph prior to the all_gather operation. If you really need autograd to extend beyond comm ops, you can try this experimental autograd-powered all_gather.

My intension here is only to gather information regarding loss, accuracy and calculate F1 score, precision. I don’t want to run backward on gathered data. Am I gathering loss correctly? I am asking because in the above example tmp contains loss from GPU:0 but in some other run tmp contains loss of GPU:1

It looks correct to me.

I am asking because in the above example tmp contains loss from GPU:0 but in some other run tmp contains loss of GPU:1

Could you please share a repro where tmp[0] contains loss from rank 1? The above output seems only shows loss from rank 0?

Code:

loss = criterion(output, targets)
tmp = [torch.empty_like(loss).cuda(rank) for _ in range(2)]
dist.all_gather(tmp, loss)
print(tmp)
loss.backward()
optimizer.step()
print(f'Model on GPU {rank}, Epoch: {epoch}--{i}/{len(dataloader)}],  Loss: {loss.data.item()}')

Output:

[tensor(0.6518, device='cuda:0'), tensor(0.7940, device='cuda:0')]
[tensor(0.6518, device='cuda:1'), tensor(0.7940, device='cuda:1')]
Model on GPU 0, Epoch: 1--0/28125],  Loss: 0.6518099904060364
Model on GPU 1, Epoch: 1--0/28125],  Loss: 0.7939583659172058
[tensor(0.7865, device='cuda:1'), tensor(0.7719, device='cuda:1')]
[tensor(0.7865, device='cuda:0'), tensor(0.7719, device='cuda:0')]
Model on GPU 0, Epoch: 1--1/28125],  Loss: 0.7865331172943115
Model on GPU 1, Epoch: 1--1/28125],  Loss: 0.7718786001205444
[tensor(0.7348, device='cuda:1'), tensor(0.8238, device='cuda:1')]
[tensor(0.7348, device='cuda:0'), tensor(0.8238, device='cuda:0')]

tmp right now contains two tensors(loss from each GPU).

On a single GPU I train with a batch size of 32. Since, I have a very deep model and huge dataset I thought of using multiple GPUs in order to increase training speed. After reading this I came to know I need to divide my batch size and train model with a batch size of 16 for two GPUs. Gradient is computed on batch of 16 on each GPU and average of gradient is applied to the models which gives an effect as in one iteration a batch of 32 is processed by GPUs and gradient is applied. I also want to gather the loss which will be equivalent of this batch of 32.

The output looks as-expected to me: the tmp[0] contains rank0 loss and tmp[1] contains rank1 loss. Did I miss anything?

Okay Let me try to rephrase my question, gather has given me loss calculated by each GPU. My batch size for model being trained on 2 GPUs is 16 which means at a time 32 images will be processed. The gradient step that was taken was with respect to 32 images. torch.distributed.gather returns loss with respect to 16 images on each GPU in a list. I want to get the loss with respect to 32 images that were processed in one iteration in total.

I see. In that case, DDP alone won’t be sufficient, as DDP’s output and loss are local to each process. If you only need to calculate the globally loss, one option is to gather the outputs instead of loss, and then calculated loss on the gathered outputs. If you also need back propagation from the global loss, there are at least two options:

  1. combine DDP with RPC: Use DDP to compute local output and use RPC to collect them into one process and compute global loss, and then use distributed autograd to start the backward pass from global loss. This tutorial shows how to combine DDP with RPC.
  2. Use the autograd-enabled collective communications in torch.distributed.nn.functional. Tests in this PR can serve as examples. (This feature is not officially released yet.)

I have another question is that what would be the best way to do all_gather when the batch size isn’t fixed? E.g. when drop_last=False in the Dataloader so the last batch size will be different?