during training transformer model in PyTorch , I met this RuntimeError from PyTorch:::
RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch_1532581333611/work/torch/lib/THD/process_group/General.cpp:17
is there anyone have met it before ?
I am confused by this error
The error seems to happend in the distrubuted package.
Could you give a small code sample to reproduce this please?
hi , I have fixed this error , just because my distribution’s tcp port is hardcode , so I use another port to run other multi-gpu task. thanks for your advice @albanD
Hi, did you use nvidia-docker with multi nodes? if you did, how to set the
master_addr which are used in
torch.distributed.launch ? I am very appreciate if you have a small code sample. Thx in advance.
Hello! How to set port? I have this error too. Thank you.
@maomaochongchh Maybe you can use the command " python -m torch.distributed.launch --master_port () --nproc_per_node=1 …
In () you can give a random port number