Same environment but setting visible devices to only 1 works fine (e.g. export CUDA_VISIBLE_DEVICES=0; python train.py …). Seems to error out in DDP? I can get past the original error by specifying up to 4 GPUs in CUDA_VISIBLE_DEVICES, but then I get a “CUDA error: an illegal memory access was encountered” error for 2 or more GPUs w/DDP.
Smoke test (another data point during debugging):
export CUDA_VISIBLE_DEVICES=0,1,2,3; python -c ‘import torch; torch.cuda.is_available()’
works fine, but then adding more than 4 GPUs fails:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4; python -c ‘import torch; torch.cuda.is_available()’
/home/adaboost/miniconda3/envs/mustango/lib/python3.10/site-packages/torch/cuda/init.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0