Why my DistributedDataParallel is slower than DataParallel if my Dataset is not loaded fully in memory

Is it possible, that if your data is small enough to entirely fit into the memory, the DDP setup overhead is just increasing time on the task without any performance improvement? In other words: GPU utilization is small enough, you just can’t see the gains of using multiple GPUs

1 Like