Distributed Tutorial works with mp.spawn but not torchrun

Marcelo_Sena · March 28, 2025, 9:31pm

Hi, I am following the distributed data parallel tutorial for training on mulitple gpus. I am using an HPC server to run codes, requesting 2 gpus in a single node.

I have managed to succesfully run the code in the distributed tutorial here Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.6.0+cu124 documentation with the mp.spawn syntax.

However, I get an error when I initialize it with torchrun, getting the error below:

Here is the code I use to run it:
torchrun --nnodes=1 --nproc_per_node=2 --standalone distributed_tutorial_script.py

Any help, hints or pointers are very much appreciated.
Thanks!