How to tackle "RuntimeError address already in use"

during training transformer model in PyTorch , I met this RuntimeError from PyTorch:::

RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch_1532581333611/work/torch/lib/THD/process_group/General.cpp:17

is there anyone have met it before ?

I am confused by this error :slightly_frowning_face:


The error seems to happend in the distrubuted package.
Could you give a small code sample to reproduce this please?

hi , I have fixed this error , just because my distribution’s tcp port is hardcode , so I use another port to run other multi-gpu task. thanks for your advice @albanD

Hi, did you use nvidia-docker with multi nodes? if you did, how to set the master_addr and master_addr which are used in torch.distributed.launch ? I am very appreciate if you have a small code sample. Thx in advance.

1 Like

Hello! How to set port? I have this error too. Thank you.

@maomaochongchh Maybe you can use the command " python -m torch.distributed.launch --master_port () --nproc_per_node=1 …
In () you can give a random port number