I use the command line below to run the script(pytorch example of dali)
python -m torch.distributed.launch --nproc_per_node=1 train_imagenet_with_dali.py -t
when the nproc_per_node=1,it work. But when nproc_per_node=2,there is a error.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
=> creating model 'resnet50'
=> creating model 'resnet50'
Traceback (most recent call last):
File "/home/zyy/anaconda3/envs/python36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/zyy/anaconda3/envs/python36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/zyy/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 246, in <module>
main()
File "/home/zyy/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zyy/anaconda3/envs/python36/bin/python', '-u', 'train_imagenet_with_dali.py', '--local_rank=1', '-t']' died with <Signals.SIGSEGV: 11>.