Runtime error using Distributed

Hello everyone,

I was trying to use the PyTorch distributed package, however, I came across the following error

Traceback (most recent call last):
  File "train_parallel_ch_classifier.py", line 385, in <module>
  File "train_parallel_ch_classifier.py", line 385, in <module>
    main(args)
  File "train_parallel_ch_classifier.py", line 35, in main
    main(args)
  File "train_parallel_ch_classifier.py", line 35, in main
    world_size  = args.world_size)
  File "/z/sw/packages/pytorch/0.2.0/lib/python2.7/site-packages/torch/distributed/__init__.py", line 46, in init_process_group
    world_size  = args.world_size)
  File "/z/sw/packages/pytorch/0.2.0/lib/python2.7/site-packages/torch/distributed/__init__.py", line 46, in init_process_group
    group_name, rank)
    RuntimeErrorgroup_name, rank)
: world_size was not set in config at /z/tmp/build/pytorch-0.2.0/torch/lib/THD/process_group/General.cpp:17
RuntimeError: world_size was not set in config at /z/tmp/build/pytorch 0.2.0/torch/lib/THD/process_group/General.cpp:17

I am using python 2.7, with PyTorch 0.2 installed from source. Below is how I initialize

dist.init_process_group(backend     = 'gloo',
                            init_method = '/z/home/mbanani/nonexistant',
                            world_size  = args.world_size)

Any thoughts on what may be causing this or how I can fix it ?

Thank you

1 Like

I’m not sure what the exact issue is, can you post a fuller example to run?
Alternatively read our new distributed tutorial here: http://pytorch.org/tutorials/intermediate/dist_tuto.html to see if it helps.

For shared file initialization, you need to specify ‘file:///z/home/mbanani/nonexistant’ in init_method.

1 Like

I had the same error,how did you solve the problem?