I’m training with DDP on a slurm server using CPUs and gloo. I set_num_threds to the CPUs per process and Slurm_n_tasks equals the number of nodes.
When I train on more than one node it gets much slower and takes up a lot of extra memory, requiring a decrease in batch size per process. The result is I get a sizable performance loss, even though scaling withing a single node is pretty good.
I’m wondering if anyone has any ideas why this happens and how I might improve it.