Hello,
I have a problem/question regarding training a PyTorch model on a cluster with multiple CPU nodes using SLURM. I saw that the easiest way would be to utilize DDP and torchrun, but all examples that I found were only for Multi-GPU training, not Multi-CPU. I then used the GPU examples and adapted them to work on the CPUs (e.g., using “gloo” instead of “nccl”, etc.). Unfortunately, I only got 2 outcomes that didn’t work:
-
The code responded with an error, in which it said that there are no available CUDA devices. Because I try to run on multiple CPUs without any GPUs, this error shouldn’t even come up. I found that I should not use any form of
.to(rank)inside the code to prevent it from automatically wanting to use GPUs. However, this resulted in the second problem. -
If no
.to(rank)is defined when using more than one CPU; the SLURM job assigns multiple CPUs, but the training is still only running on one CPU node.
I am now at a point where I don’t know what to do anymore. I need to speed up my training for multiple CPUs because I don’t have a GPU available, but the only options I find are DDP and torchrun, which don’t work for me with only CPUs. It is also very possible that I do something wrong, but I don’t know what. Does someone might know an alternative or a guide for training on multiple CPU nodes? I would be really thankful for any help I could get. It is kind of urgent.
Thank you in advance for your replies.