hi , I have fixed this error , just because my distribution’s tcp port is hardcode , so I use another port to run other multi-gpu task. thanks for your advice @albanD
Hi, did you use nvidia-docker with multi nodes? if you did, how to set the master_addr and master_addr which are used in torch.distributed.launch ? I am very appreciate if you have a small code sample. Thx in advance.
@maomaochongchh Maybe you can use the command " python -m torch.distributed.launch --master_port () --nproc_per_node=1 …
In () you can give a random port number