I understand that the loss.backward() when training using DDP overlaps compute and communication by sharing the gradients in a bucketed way i.e. the all-reduce communication takes place for each bucket of gradients that have been calculated. Is there a way to break down the time taken for these 2 compute (gradient calculation) and communication (syncing gradients using all-reduce) phases?
Any pointers regarding this would be extremely helpful!
You could profile the workload with e.g. Nsight Systems and check the timeline to see the communication and compute kernels. This should allow you to estimate the times of these workloads.