Multiple NCCL calls reduce training speed?


I tried to run 2 training jobs on a GPU cluster with 8 GPUs, both jobs need to use NCCL AllReduce. And I notice the training speed is slower if running the 2 jobs at same time than running them separately. Is this because they are competing for the bandwidth between GPU communications (the AllReduce call)? Thanks.

Hi @Yi_Zhang Yes, I would think that this is likely the case and running multiple training jobs will result in more competition for the GPU’s bandwidth and thus overall a slower performance compared to running either job individually.

Are you noticing any extreme slowness/hangs that may be more indicative of a bug?

Hi @rvarm1, thanks for reply. I’m not sure if it is a bug, since the program doesn’t hang, but it affects the speed greatly. And if I reduce the sync frequency, it can help to speed up, so I feel the communication between GPUs may be a bottleneck.