Does DDP Work With Cards in Exclusive Process Mode?

nimu · March 9, 2022, 9:13am

Hello, I have run into following errors on our cluster:

Traceback (most recent call last):
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 141, in <module>
    torch.multiprocessing.spawn(worker, nprocs=procinit['nprocs'], args=(procinit,))
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 84, in training_process
    net, device  = allocate_network_for_process(procinit['net'], pid, procinit['device'])
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 60, in allocate_network_for_process
    return allocate_network_for_process_on_gpu(net, pid)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 47, in allocate_network_for_process_on_gpu
    net.cuda(device)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 680, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
    param_applied = fn(param)
  File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 680, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The cards are allocated correctly. But it seems that the issue is that they are in set to Exclusive Process Mode. Can I use DDP with cards set to Exclusive Process Mode? Is there some workaround aside of reseting cards mode?

Many thanks

>>> import torch
>>> torch.__version__
'1.10.2'

nimu · March 9, 2022, 5:46pm

Unfortunately I wasn’t able to reproduce the error above locally. Setting GPU to Exclusive Process Mode seems to work fine, but I only have one GPU and one worker to try. Not sure if that happens when multiple workers on multiple GPUs are launched.

In case of two workers on two GPUs the pstree looks like this:

$ pstree -Talp 934813
python,934813 -um src.deit_vis_loc.train_model --dataset-dir .git/input/dataset --metafile .git/input/queries_meta.json --train-params .git/input/train_params.json --output-dir .git/output --device cuda --workers 2
  ├─python,934908 -c from multiprocessing.semaphore_tracker import main;main(16)
  ├─python,934933 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=171) --multiprocessing-fork
  └─python,934937 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=17, pipe_handle=174) --multiprocessing-fork

rvarm1 · March 10, 2022, 4:21pm

According to nvidia-smi EXCLUSIVE_PROCESS - CUDA Programming and Performance - NVIDIA Developer Forums it seems that exclusive process mode limits GPU usage to one process that has already established a context on the device.

In this case it looks like module initialization (outside of DDP) is still going on during the crash which seems to indicate that another process (perhaps a DDP training process) is using the GPU. Could you check to ensure you’re using a single distinct GPU per process and no other processes on your machine are using the GPU?

nimu · March 11, 2022, 8:20am

Thank you for the reply! I can confirm that I am using a GPU per process - launching with just one worker reproduces the error message too. It seems that there is an issue with our cluster where some resources hang on the graphics card:

Works locally in Exclusive Process Mode with one worker
Doesn’t work on our server in Exclusive Process Mode with one worker

I have tried two cudatools versions 11.0 and 11.3 and the error still appears.