Hello, I have run into following errors on our cluster:
Traceback (most recent call last):
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 141, in <module>
torch.multiprocessing.spawn(worker, nprocs=procinit['nprocs'], args=(procinit,))
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 84, in training_process
net, device = allocate_network_for_process(procinit['net'], pid, procinit['device'])
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 60, in allocate_network_for_process
return allocate_network_for_process_on_gpu(net, pid)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/src/deit_vis_loc/train_model.py", line 47, in allocate_network_for_process_on_gpu
net.cuda(device)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 680, in cuda
return self._apply(lambda t: t.cuda(device))
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply
module._apply(fn)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply
param_applied = fn(param)
File "/mnt/matylda1/Locate/cz.vutbr.fit.cphoto.deit-vis-loc/miniconda/envs/foo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 680, in <lambda>
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The cards are allocated correctly. But it seems that the issue is that they are in set to Exclusive Process Mode. Can I use DDP with cards set to Exclusive Process Mode? Is there some workaround aside of reseting cards mode?
Many thanks
>>> import torch
>>> torch.__version__
'1.10.2'