Breaking down compute and communication in loss.backward()

pnnaman · June 20, 2023, 8:36pm

I understand that the loss.backward() when training using DDP overlaps compute and communication by sharing the gradients in a bucketed way i.e. the all-reduce communication takes place for each bucket of gradients that have been calculated. Is there a way to break down the time taken for these 2 compute (gradient calculation) and communication (syncing gradients using all-reduce) phases?

pnnaman · October 26, 2023, 5:10am

Any pointers regarding this would be extremely helpful!

ptrblck · October 26, 2023, 6:05pm

You could profile the workload with e.g. Nsight Systems and check the timeline to see the communication and compute kernels. This should allow you to estimate the times of these workloads.