Launching a job for DistributedDataParallel using torch.distributed.launch works fine on the first time. On the second time, I get RuntimeError: Address already in use
.
I’ve tried modifying MASTER_ADDR, but I get RuntimeError: Connection timed out
. What is the proper way to make sure the distributed jobs do not collide?