What is a proper way to troubleshoot ddp speed gain?

what is the proper way to troubleshoot the gain in speed with ddp ( DistributedDataParallel) in order to find bottlenecks?

i have 2 different models. model 1 (resnet) and model 2 (unet).
size of model 2 (23million) > size of model1 (32million)

in term of data, model2 does an additional disc access to load more sample-related data, while model1 only load the basic data.

we consider using either single gpu or 2 gpus.
when using 2 gpus, model 1 gains 50% of speed compared to single gpu.
in the other hand, model 2 only gain 15%.

i am using datalaoder with distributedsampler and 4 workers per process.
every process uses only one gpu.

both gpus are in the same machine.
use nccl backend.


did some timing.
using multi-gpus makes a type of loss acts up.
the loss runs on cpu.
so, not sure why is that.
when using a single gpu, the loss takes 70ms to forward.
when using multigpus, it takes 600ms.

looking why this is happening.
the observed gain in time of 15% came from the validation that is done in multigpu.


it turns out one of the losses is heavily using multi-threading

I’m wondering why loss is not computed on GPUs?

it is a custom c++ cpu-loss provided in another work.
currently looking for a gpu version.
from initial run, gpu loss does not seem to be faster than the multi-threaded c++ cpu implementation.

still looking to find why and how to further speed up the gpu version.