Error when using DDP on multiply gpus


When I trained by 4 GPUs like this:
python -m torch.distributed.launch --nproc_per_node=4 .
There would be a error:

Traceback (most recent call last):
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/", line 85, in _run_code
    exec(code, run_globals)
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/site-packages/torch/distributed/", line 263, in <module>
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/site-packages/torch/distributed/", line 259, in main
subprocess.CalledProcessError: Command '['/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/bin/python', '-u', '', '--local_rank=3']' returned non-zero exit status 1.

Could anyone tell me what happend?

Can you share a minimum repro of, especially how you call init_process_group and DistributedDataParallel ctor?

BTW, could you please add a “distributed” tag for future torch.distributed-related posts? So that the PT distributed team can get back to you promptly.

1 Like

Thank you very much, I finally find the error occurs just because I don’t have permission to modify the file. Thanks again!

1 Like