RuntimeError with Distributed package using NCCL or GLOO backend

enisberk · November 30, 2018, 10:32pm

Hi,

I am trying to use distributed package with two nodes but I am getting runtime errors.
I am using Pytorch nightly version with Python3. I have two scripts one for master and one for slave (code: master, slave). I tried both gloo and nccl backends and got the same errors.

I am getting following error on master:

Traceback (most recent call last):
  File "s_testm.py", line 86, in <module>
    main()
  File "s_testm.py", line 83, in main
    init_processes(rank,size,run,master_ip)
  File "s_testm.py", line 57, in init_processes
    dist.init_process_group(backend)
  File "/opt/conda/envs/fastai/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 280, in init_process_group
    store, rank, world_size = next(rendezvous(init_method))
  File "/opt/conda/envs/fastai/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 131, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, start_daemon)
RuntimeError: Connection timed out

And following error on the slave:

Traceback (most recent call last):
  File "s_tests.py", line 76, in <module>
    main()
  File "s_tests.py", line 73, in main
    init_processes(rank,size,run,master_ip)
  File "s_tests.py", line 52, in init_processes
    dist.init_process_group(backend)
  File "/opt/conda/envs/fastai/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 283, in init_process_group
    _default_pg = ProcessGroupGloo(store, rank, world_size)
RuntimeError: Resource temporarily unavailable

I am using machines from Paperspace, I am using their shared file system to exchange IP addresses between master and slave. I am not using file system initialization because their file system does not support fcntl locking.

To run the codes, I am using their command line python client as following:

#master
paperspace jobs create --project myproject --machineType P5000 --container paperspace/fastai:1.0-CUDA9.2-base --ports 29500:29500 --command 'apt-get update;apt-get install libibverbs1;echo "start";python3 s_testm.py' --workspace "./"

#slave
paperspace jobs create --project myproject --machineType P5000 --container paperspace/fastai:1.0-CUDA9.2-base --ports 29500:29500 --command 'apt-get update;apt-get install libibverbs1; python3 s_tests.py' --workspace "./"

As you can see, machine have P5000 gpu, I am using their fast-ai container with pytorch 1.0, and I am opening port 29500 to use for initialization.
Container lacks libibverbs1 library, so I am installing it before running my scripts.

I am open to all kinds of help, thanks in advance.

zkzhu0110 · December 21, 2018, 7:55am

I met the same problem as this. I think that it may be related to the communication way of pytorch between different containers.

Shirley_Han · October 9, 2019, 8:02am

Hi,

Did you solve this problem?

I also ran into some error related to TCPStore.

gradwolf · November 12, 2019, 12:39am

Same issue here.

I am trying to run a job with distributed training in a single node with 4 GPUs. I have set the IP address of my instance and port number. I am getting this error. Any solutions?

File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection timed out

enisberk · November 12, 2019, 8:56pm

It was an network setup issue related to docker so we could not solve it. You need to make sure nodes/threads can communicate through network.
Paperspace solved this problem in their latest version by changing their networking setup.