GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-744fbd6b-59e0-0daa-223e-19ac51400ee0)
MIG 4g.20gb Device 0: (UUID: MIG-a917c543-57fe-51ea-8bca-6ec8c68518c5)
MIG 2g.10gb Device 1: (UUID: MIG-c170d227-b149-5a1e-85b4-706e305dff4b)
To run on MIG 4g.20gb:
CUDA_VISIBLE_DEVICES=MIG-a917c543-57fe-51ea-8bca-6ec8c68518c5 python vittrain.py
To run on MIG 2g.10gb:
CUDA_VISIBLE_DEVICES=MIG-c170d227-b149-5a1e-85b4-706e305dff4b python vittrain.py
However, the second command cannot execute, error returned is
RuntimeError: No CUDA GPUs are available
torch.cuda.device_count() returns 1, which should be 2.
Is this a bug?
You won’t be able to use multiple MIG devices in a single script, so that’s expected.
Could you post the exact MIG commands you’ve used which would reproduce that the second device cannot be used?
CUDA_VISIBLE_DEVICES=MIG-c170d227-b149-5a1e-85b4-706e305dff4b python vgginfer.py
Every py file I can run it with
CUDA_VISIBLE_DEVICES=$UUID python model.py on the first MIG instance.
However, I cannot run it on the second instance.
I also try add
import torch in the py file, it also doesn’t work.
The posted commands don’t show how you’ve created the MIG setup and how to reproduce the issue, so could you post a minimal, executable code snippet which reproduces the issue, please?
I have found the reason, MPS.
MPS was ran on the Top of the device, instead of on each instance.
In this case, MPS places data and launches kernels on the first instance only, since it has no idea about other instances.
After I shut down MPS, it works for me.