Net.cuda() cannot access MIG instances

Tianyu9748 · October 25, 2022, 1:43pm

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-744fbd6b-59e0-0daa-223e-19ac51400ee0)
  MIG 4g.20gb     Device  0: (UUID: MIG-a917c543-57fe-51ea-8bca-6ec8c68518c5)
  MIG 2g.10gb     Device  1: (UUID: MIG-c170d227-b149-5a1e-85b4-706e305dff4b)

To run on MIG 4g.20gb:

CUDA_VISIBLE_DEVICES=MIG-a917c543-57fe-51ea-8bca-6ec8c68518c5 python vittrain.py

To run on MIG 2g.10gb:

CUDA_VISIBLE_DEVICES=MIG-c170d227-b149-5a1e-85b4-706e305dff4b python vittrain.py

However, the second command cannot execute, error returned is RuntimeError: No CUDA GPUs are available

and torch.cuda.device_count() returns 1, which should be 2.

Is this a bug?

Best
Max

ptrblck · October 25, 2022, 6:05pm

You won’t be able to use multiple MIG devices in a single script, so that’s expected.
Could you post the exact MIG commands you’ve used which would reproduce that the second device cannot be used?

Tianyu9748 · October 25, 2022, 7:51pm

Hi ptrblck,
Command is:

CUDA_VISIBLE_DEVICES=MIG-c170d227-b149-5a1e-85b4-706e305dff4b python vgginfer.py

Every py file I can run it with CUDA_VISIBLE_DEVICES=$UUID python model.py on the first MIG instance.
However, I cannot run it on the second instance.

I also try add CUDA_VISIBLE_DEVICES=$UUID before import torch in the py file, it also doesn’t work.

Best
Max

ptrblck · October 25, 2022, 9:42pm

The posted commands don’t show how you’ve created the MIG setup and how to reproduce the issue, so could you post a minimal, executable code snippet which reproduces the issue, please?

Tianyu9748 · October 27, 2022, 11:05pm

I have found the reason, MPS.
MPS was ran on the Top of the device, instead of on each instance.
In this case, MPS places data and launches kernels on the first instance only, since it has no idea about other instances.

After I shut down MPS, it works for me.

Best
Max