Set CUDA_VISIBLE_DEVICE, Multiple MIGs, Single Job

J_W · February 12, 2023, 2:35am

I am attempting to use a package that employs pytorch and I keep getting errors when asking it to select GPUs 0 and 1, saying there are not that many GPUs.

CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1670525552411/work/aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.

The script tries to set GPUs via the following line of code, where gpu_list is ‘0,1’
os.environ['CUDA_VISIBLE_DEVICES'] = gpu_list --> gpu_list = '0,1'

I have noted that the cluster I am using uses MIGs. How could I set CUDA_VISIBLE_DEVICES to multiple MIGs for a single script? I noted the following thread, but it applies to using MIGs on parallel jobs rather than together: Access GPU partitions in MIG

ptrblck · February 12, 2023, 3:12am

You cannot use multiple MIG instances in the same process and need to select one via CUDA_VISIBLE_DEVICES.

J_W · February 12, 2023, 3:25am

Is it possible to select the two GPUs rather than two MIGs, or would that require removing the multi-instancing?

ptrblck · February 12, 2023, 3:44am

I think you would need to disable MIG to select two (full) GPUs.

vkalidas · May 22, 2023, 6:31pm

Hi @ptrblck , Will the Multiple MIGs be supported in future pytorch versions? Are there any other alternatives with H100 GPU?

ptrblck · May 22, 2023, 7:38pm

I’m not aware of relaxing the limitation and note it’s a CUDA limitation, nor a PyTorch one.