Hi,
when I run unit tests with DistributedDataParallel components I always end up with the following exception:
[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
This is probably the relevant piece of code:
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
tdist.init_process_group(backend='gloo', rank=rank, world_size=world_size)
I tried different ports, but I always run into that issue. Outside unittests it works.
Any ideas?
Best,
Thorsten
PS: I’m using the latest nightly