Hi,
I have 4x A100 GPUs on a single machine.
With PyTorch 2.0.0+cu117
and nvidia driver version of 515.105.01
installed, the following is working as expected:
>>> torch.cuda.is_available()
True
However, when I was trying to run a simple torch.zeros([1]).cuda()
command, it kept throwing Runtime errors at cuda init stage:
File "~/.conda/envs/torchtest/lib/python3.8/site-packages/torch/cuda/__init__.py", line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
When I set export CUDA_VISIBLE_DEVICES=0
, the above .cuda()
command worked, but unsetting this env var broke it again.
Where in my set up could have gone wrong?
1 Like
I guess the env variable might be set to an invalid value since directly exporting it seems to work.
If that’s not the case, do you have multiple GPUs installed where some are dead or inactive (we had recently a similar issue in this forum where 2 GPUs were installed in the system while one of them was not plugged in and caused issues during the initialization).
Thanks @ptrblck.
I think it has to do with the A100 GPU setup - it has MIG devices configured, and my guess now is that those were not configured correctly for CUDA to enumerate all the devices:
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-6a0256bd-cc1d-7e2a-d8ba-b6d5deefb5ff)
MIG 7g.80gb Device 0: (UUID: MIG-f659fe52-a79b-5941-807c-2366430ee70e)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-bc5e96fd-a778-84ea-83a8-bb24a49c2d91)
MIG 7g.80gb Device 0: (UUID: MIG-7cc70377-7e53-5450-807d-45f954c77439)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-a556b2d7-6245-0820-54d7-537606e233e6)
MIG 7g.80gb Device 0: (UUID: MIG-20b4d0ee-83cf-5f08-930a-153b788d38cd)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-450d7302-f972-c964-b90a-bcfc396949ab)
MIG 7g.80gb Device 0: (UUID: MIG-bd9d4fd8-b734-5cff-95e6-2d7dd1e0272f)
Will update once I have more findings and a fix.
Thanks for the update. In that case disable MIG or specify the UUID in CUDA_VISIBLE_DEVICES
.
Yep! Disabling the MIG config worked!