How to tackle "RuntimeError address already in use"


(Liam) #1

during training transformer model in PyTorch , I met this RuntimeError from PyTorch:::

RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch_1532581333611/work/torch/lib/THD/process_group/General.cpp:17

is there anyone have met it before ?


(Liam) #2

I am confused by this error :slightly_frowning_face:


(Alban D) #3

Hi,

The error seems to happend in the distrubuted package.
Could you give a small code sample to reproduce this please?


(Liam) #4

hi , I have fixed this error , just because my distribution’s tcp port is hardcode , so I use another port to run other multi-gpu task. thanks for your advice @albanD


(Johnny Law) #5

Hi, did you use nvidia-docker with multi nodes? if you did, how to set the master_addr and master_addr which are used in torch.distributed.launch ? I am very appreciate if you have a small code sample. Thx in advance.


(Crystal) #6

Hello! How to set port? I have this error too. Thank you.


#7

@maomaochongchh Maybe you can use the command " python -m torch.distributed.launch --master_port () --nproc_per_node=1 …
In () you can give a random port number