- If I train on a single GPU with a batch size of
b
, do I need to divide this batch size by the number of GPUs available for training in DDP.
Yep, you can start with dividing the batch size. But depending on the loss function and whether each process is consuming the same number of samples per iteration, DDP may or may not give you the exactly same result as local training. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel
- How can I calculate F1 score, Precision, Recall for a model being trained in DDP?
DDP does not change the behavior of the forward pass. So, these metrics can be calculated similar to local training. But since now the outputs and loss locate on multiple GPUs, you might need to gather/allgather them first if you need global numbers.
- If I store local loss of two GPUs in two arrays. Is okay if I add them and divide by number of GPUs to get an average?
Similar to the 1st bullet, this depends on your loss function. If it’s sth like MSE, then yes, the average of two local loss should be the sane of the global one. But other loss functions might not have this property.