Is it possible, that if your data is small enough to entirely fit into the memory, the DDP setup overhead is just increasing time on the task without any performance improvement? In other words: GPU utilization is small enough, you just can’t see the gains of using multiple GPUs
1 Like