One GPU lost, how to train on another without reboot?

Hello!

I have 4 GPU on my computer. And I have one lost:

$>nvidia-smi
Unable to determine the device handle for GPU 0000:86:00.0: GPU is lost.

But another one is still working. e.g.

$>nvidia-smi -i 0

works good.

For the some reasons, I don’t want to reboot (some other computations).
Is there any way to specify which GPU to use for the PyTorch?

Right now even torch.cuda.is_available() returns False, and any attempt to x.to(device) returns initialization errors.
Restart of python kernel doesn’t help the situation.

What’s the output of:

dmesg | grep GPU

?

[   17.159322] [drm] [nvidia-drm] [GPU ID 0x00001800] Loading driver
[   17.160936] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[   17.162486] [drm] [nvidia-drm] [GPU ID 0x00008600] Loading driver
[   17.164018] [drm] [nvidia-drm] [GPU ID 0x0000af00] Loading driver
[838654.790542] NVRM: GPU at PCI:0000:86:00: GPU-811a52e6-d27e-3f2d-b972-503473cd15d9
[838654.790551] NVRM: GPU Board Serial Number:
[838654.790556] NVRM: Xid (PCI:0000:86:00): 79, GPU has fallen off the bus.
[838654.790631] NVRM: GPU at 00000000:86:00.0 has fallen off the bus.
[838654.790633] NVRM: GPU is on Board .
[838654.790657] NVRM: A GPU crash dump has been created. If possible, please run

Well I don’t think you can do something easily without rebooting for this time.
But to avoid this problem in the future, you should set persistence mode:

/usr/bin/nvidia-smi -pm 1

Find complete info about it here https://www.cyberciti.biz/faq/debian-ubuntu-rhel-fedora-linux-nvidia-nvrm-gpu-fallen-off-bus/

I have them in persistence mode already.
Also, if you know the answer, what’s the underlying mechanism, why should it help?

You can specify which GPU to use by specifing CUDA_VISIBLE_DEVICES environment variable. This will tell pytorch which GPU to use.

Many thanks!
I’ve found this solution, and thought that it doesn’t work, but more accurate check shows, that it works, but with some additional porps.
It doesn’t see the card, which is in order after the dropped one (3 is invisible, 2 is dropped, 0 and 1 are ok).

test_gpu.py:

import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.zeros(10, 10).to(torch.device('cuda:2')))
$>export CUDA_VISIBLE_DEVICES=0,1,3
$>python test_gpu.py
True
2
Traceback (most recent call last):
  File "test_gpu.py", line 4, in <module>
    print(torch.zeros(10, 10).to(torch.device('cuda:2')))
RuntimeError: CUDA error: invalid device ordinal

Do you have any ideas, how to fix it?

This is happening because you’re not passing correct gpu id. If you’re not using docker, do nvidia-smi to see GPU ids and then specify which GPU you want. Like this
CUDA_VISIBLE_DEVICES=0 python3 main.py
FYI, if you are specifying multiple gpu ids, you need to ensure that your model has multi gpu support with nn.dataparallel, if not consider using only one.

But I’m moving tensor to the specific device.
Moreover, if I set the only one visible GPU, all is ok with 0 or 1, but doesn’t work for 3. (2 is the dropped one).

And yes, I’ve checked the device id’s, and moreover I’ve used code like this before GPU failed. So everything is ok with id, I think.

Hi, I am facing same issue, did you find the root of the error?
I have only 2nd GPU lost and I want to continue training.
Yesterday I did reboot, but today the error is back, so any inputs about what was your cause would be great. Thanks!

Was it hardware problem like here?