Error when using DDP on multiply gpus

-1

When I trained by 4 GPUs like this:
python -m torch.distributed.launch --nproc_per_node=4 train_net.py .
There would be a error:

Traceback (most recent call last):
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home_ex/tianhongtao/SW/anaconda3/envs/Hisense/bin/python', '-u', 'Run.py', '--local_rank=3']' returned non-zero exit status 1.

Could anyone tell me what happend?

Can you share a minimum repro of train_net.py, especially how you call init_process_group and DistributedDataParallel ctor?

BTW, could you please add a “distributed” tag for future torch.distributed-related posts? So that the PT distributed team can get back to you promptly.

1 Like

Thank you very much, I finally find the error occurs just because I don’t have permission to modify the file. Thanks again!

1 Like