What is the proper way to use torch distributed 2+ times on a single node?

mrdrozdov · January 27, 2019, 8:59pm

Launching a job for DistributedDataParallel using torch.distributed.launch works fine on the first time. On the second time, I get RuntimeError: Address already in use.

I’ve tried modifying MASTER_ADDR, but I get RuntimeError: Connection timed out. What is the proper way to make sure the distributed jobs do not collide?

mrdrozdov · January 27, 2019, 9:24pm

Since localhost is okay, it turns out changing the port only (MASTER_PORT) was sufficient.