Something wrong when i tried to use two gpus in a node

zyyupup · September 29, 2019, 1:05pm

I use the command line below to run the script(pytorch example of dali)

python -m torch.distributed.launch --nproc_per_node=1 train_imagenet_with_dali.py -t

when the nproc_per_node=1,it work. But when nproc_per_node=2,there is a error.

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
=> creating model 'resnet50'
=> creating model 'resnet50'
Traceback (most recent call last):
  File "/home/zyy/anaconda3/envs/python36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zyy/anaconda3/envs/python36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zyy/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 246, in <module>
    main()
  File "/home/zyy/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zyy/anaconda3/envs/python36/bin/python', '-u', 'train_imagenet_with_dali.py', '--local_rank=1', '-t']' died with <Signals.SIGSEGV: 11>.

spanev · September 29, 2019, 10:44pm

Hi @zyyupup,

Can you try this please:

python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 --dali_cpu --fp16 --b 32 --static-loss-scale 128.0 --workers 4 --lr=0.4 ./ 2>&1

And give us full repro step and your environment (Anaconda, PyTorch, DALI versions)

zyyupup · October 1, 2019, 10:57am

Thank you for your reply. I tried the commamd and the same error accured again. My environment is:

Anaconda(python):3.6.8
Pytorch:1.2.0+cuda9.2
DALI:0.13.0
System:ubuntu16.04 with 2 gpus

I only changed the path of the dataset in the code and then run it.

JanuszL · October 1, 2019, 11:43am

DALI example is based on an older version of PyTorch APEX example - https://github.com/NVIDIA/apex/tree/master/examples/imagenet. You can try to run it as well to check if this may be the DALI fault.