what is the proper way to troubleshoot the gain in speed with ddp (
DistributedDataParallel) in order to find bottlenecks?
i have 2 different models. model 1 (resnet) and model 2 (unet).
size of model 2 (23million) > size of model1 (32million)
in term of data, model2 does an additional disc access to load more sample-related data, while model1 only load the basic data.
we consider using either single gpu or 2 gpus.
when using 2 gpus, model 1 gains 50% of speed compared to single gpu.
in the other hand, model 2 only gain 15%.
i am using datalaoder with distributedsampler and 4 workers per process.
every process uses only one gpu.
both gpus are in the same machine.
use nccl backend.