How to solve "RuntimeError: Address already in use" in pytorch distributed training?

In the pytorch distributed training, I met a RuntimeError as following:

Traceback (most recent call last):
  File "visual/distribution_train.py", line 387, in <module>
    main()
  File "visual/distribution_train.py", line 67, in main
    spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/public/home/fengm/.conda/envs/fm_pytorch_env/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/public/home/fengm/.conda/envs/fm_pytorch_env/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/public/home/fengm/.conda/envs/fm_pytorch_env/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/public/home/fengm/vehicle_reid/pytorch-pose-master/visual/distribution_train.py", line 74, in main_worker
    dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,world_size=args.world_size, rank=args.rank)
  File "/public/home/fengm/.conda/envs/fm_pytorch_env/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/public/home/fengm/.conda/envs/fm_pytorch_env/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Address already in use

pytorch distributed initial setting is

torch.multiprocessing.spawn(main_worker, nprocs=8, args=(8, args))
torch.distributed.init_process_group(backend='nccl', init_method='tcp://110.2.1.101:8900',world_size=4, rank=0)

There are 10 nodes with gpu mounted under the master node. The master node doesn’t have GPU. I used the slurm system to submit my task and my task is randomly assigned to worker node. ‘110.2.1.101’ in init_method is the master IP. I don’t kown whether is the init_method wrong?
Is there anyone have met it before ? Who can help me to fix this bug?

1 Like

What do you run in main_worker and where do the world_size=4 and rank=0 arguments to init_process_group come from? Are they hard coded, or do you list a single example?

The error itself means that multiple processes try to bind to the address and port, so I assume you are trying to run multiple processes with rank=0.

1 Like

Sorry to bring this up. I ran into the same issues, I wonder did you solve it?

you can set a free port