Net.cuda() cannot access MIG instances

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-744fbd6b-59e0-0daa-223e-19ac51400ee0)
  MIG 4g.20gb     Device  0: (UUID: MIG-a917c543-57fe-51ea-8bca-6ec8c68518c5)
  MIG 2g.10gb     Device  1: (UUID: MIG-c170d227-b149-5a1e-85b4-706e305dff4b)

To run on MIG 4g.20gb:

CUDA_VISIBLE_DEVICES=MIG-a917c543-57fe-51ea-8bca-6ec8c68518c5 python

To run on MIG 2g.10gb:

CUDA_VISIBLE_DEVICES=MIG-c170d227-b149-5a1e-85b4-706e305dff4b python

However, the second command cannot execute, error returned is RuntimeError: No CUDA GPUs are available

and torch.cuda.device_count() returns 1, which should be 2.

Is this a bug?


You won’t be able to use multiple MIG devices in a single script, so that’s expected.
Could you post the exact MIG commands you’ve used which would reproduce that the second device cannot be used?

Hi ptrblck,
Command is:

CUDA_VISIBLE_DEVICES=MIG-c170d227-b149-5a1e-85b4-706e305dff4b python

Every py file I can run it with CUDA_VISIBLE_DEVICES=$UUID python on the first MIG instance.
However, I cannot run it on the second instance.

I also try add CUDA_VISIBLE_DEVICES=$UUID before import torch in the py file, it also doesn’t work.


The posted commands don’t show how you’ve created the MIG setup and how to reproduce the issue, so could you post a minimal, executable code snippet which reproduces the issue, please?

I have found the reason, MPS.
MPS was ran on the Top of the device, instead of on each instance.
In this case, MPS places data and launches kernels on the first instance only, since it has no idea about other instances.

After I shut down MPS, it works for me.