How to train PyTorch model on multiple CPU nodes (SLURM)?

Moviaso1 · March 29, 2026, 5:04pm

Hello,

I have a problem/question regarding training a PyTorch model on a cluster with multiple CPU nodes using SLURM. I saw that the easiest way would be to utilize DDP and torchrun, but all examples that I found were only for Multi-GPU training, not Multi-CPU. I then used the GPU examples and adapted them to work on the CPUs (e.g., using “gloo” instead of “nccl”, etc.). Unfortunately, I only got 2 outcomes that didn’t work:

The code responded with an error, in which it said that there are no available CUDA devices. Because I try to run on multiple CPUs without any GPUs, this error shouldn’t even come up. I found that I should not use any form of .to(rank)inside the code to prevent it from automatically wanting to use GPUs. However, this resulted in the second problem.
If no .to(rank) is defined when using more than one CPU; the SLURM job assigns multiple CPUs, but the training is still only running on one CPU node.

I am now at a point where I don’t know what to do anymore. I need to speed up my training for multiple CPUs because I don’t have a GPU available, but the only options I find are DDP and torchrun, which don’t work for me with only CPUs. It is also very possible that I do something wrong, but I don’t know what. Does someone might know an alternative or a guide for training on multiple CPU nodes? I would be really thankful for any help I could get. It is kind of urgent.

Thank you in advance for your replies.

mjoux · April 1, 2026, 8:22am

It should be possible to instantiate torch.nn.parallel.DistributedDataParallel with device_ids=None, and then simply never call .to on anything.
For connecting with slurm, you should likely use srun ... torchrun ... within your sbatch script, there are examples for this here: Distributed training on slurm cluster

You can then just pass the relevant arguments to torchrun from the sbatch script for each node.