Pytorch distributed multiple runs on same server - server socket has failed error

MLAlex1 · May 23, 2023, 5:13pm

I have a server where there are 8 GPUs available. I want to run different runs where each run uses 2 GPU. After starting the first run, when I try to run the second with the following script:

torchrun --standalone --nproc_per_node=2 train.py --batch_size 16

I am getting the following error:

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:29400 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:29400 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.

My understanding is that this happens because it’s trying to run on the same port. I tried to set different ports using --master_port 29501 and setting ports in the script like:

os.environ['MASTER_PORT'] = "29500"
os.environ['MASTER_ADDR'] = '127.0.0.1'

but none of them worked.