Unit tests with DistributedDataParallel

Hi,

when I run unit tests with DistributedDataParallel components I always end up with the following exception:

[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.

This is probably the relevant piece of code:

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
tdist.init_process_group(backend='gloo', rank=rank, world_size=world_size)

I tried different ports, but I always run into that issue. Outside unittests it works.

Any ideas?

Best,
Thorsten

PS: I’m using the latest nightly

this can happen if the port is already in use. Often the case is that you have launched the same script previously and failed to kill it completely, leaving the port occupied. Also sometimes the case is that another user is using the port. In any case, you could use ps -ef | grep python or htop to view your processes and look for any command lines from your previous runs, and kill them. Or, use netstat and figure out what is using the port.