Issue running distributed example

adifrancesco · February 8, 2019, 8:12pm

I am new to pytorch and distributed learning in general and I’m trying to go through this tutorial here: https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html. After setting everything up, when I run the 4 different python processes (2 on each machine) I always get the following error:

File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 95, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, start_daemon)
RuntimeError: Address already in use

I feel like this is somehow related to the init_method being specified. I’m using the rank 0 machine ip and port for that value as specified in the tutorial. Nothing else is running on that port. Am I missing something about how to configure this properly?