Hi,
I am trying to use distributed package with two nodes but I am getting runtime errors.
I am using Pytorch nightly version with Python3. I have two scripts one for master and one for slave (code: master, slave). I tried both gloo and nccl backends and got the same errors.
I am getting following error on master:
Traceback (most recent call last):
File "s_testm.py", line 86, in <module>
main()
File "s_testm.py", line 83, in main
init_processes(rank,size,run,master_ip)
File "s_testm.py", line 57, in init_processes
dist.init_process_group(backend)
File "/opt/conda/envs/fastai/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 280, in init_process_group
store, rank, world_size = next(rendezvous(init_method))
File "/opt/conda/envs/fastai/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 131, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, start_daemon)
RuntimeError: Connection timed out
And following error on the slave:
Traceback (most recent call last):
File "s_tests.py", line 76, in <module>
main()
File "s_tests.py", line 73, in main
init_processes(rank,size,run,master_ip)
File "s_tests.py", line 52, in init_processes
dist.init_process_group(backend)
File "/opt/conda/envs/fastai/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 283, in init_process_group
_default_pg = ProcessGroupGloo(store, rank, world_size)
RuntimeError: Resource temporarily unavailable
I am using machines from Paperspace, I am using their shared file system to exchange IP addresses between master and slave. I am not using file system initialization because their file system does not support fcntl locking.
To run the codes, I am using their command line python client as following:
#master
paperspace jobs create --project myproject --machineType P5000 --container paperspace/fastai:1.0-CUDA9.2-base --ports 29500:29500 --command 'apt-get update;apt-get install libibverbs1;echo "start";python3 s_testm.py' --workspace "./"
#slave
paperspace jobs create --project myproject --machineType P5000 --container paperspace/fastai:1.0-CUDA9.2-base --ports 29500:29500 --command 'apt-get update;apt-get install libibverbs1; python3 s_tests.py' --workspace "./"
As you can see, machine have P5000 gpu, I am using their fast-ai container with pytorch 1.0, and I am opening port 29500 to use for initialization.
Container lacks libibverbs1 library, so I am installing it before running my scripts.
I am open to all kinds of help, thanks in advance.