Training with gloo gets slow for multiple nodes

I’m training with DDP on a slurm server using CPUs and gloo. I set_num_threds to the CPUs per process and Slurm_n_tasks equals the number of nodes.

When I train on more than one node it gets much slower and takes up a lot of extra memory, requiring a decrease in batch size per process. The result is I get a sizable performance loss, even though scaling withing a single node is pretty good.

I’m wondering if anyone has any ideas why this happens and how I might improve it.

Well, using DDP will slow the speed compared with training on a single GPU without DDP. But I’m not sure whether it needs more GPU memory.

It won’t, since you are able to process much more data and thus reduce the epoch time significantly.

Yeah, it will reduce the epoch time but increase the iteration time.