RuntimeError: CUDA unknown error - Setting available devices to be zero

AlphaBetaGamma96 · May 24, 2021, 11:02pm

Hi All,

I’m currently submitting a few scripts to a remote server with a few different GPUs and I’ve noticed that in some cases some jobs will fail due to this CUDA unknown error. The GPUs involved are specificied to use CUDA 11.0 and the PyTorch installation is 1.7.1+CUDA11. This error has emerged in few different ways but only when I called CUDA in some way. So, for example, moving my model from CPU to GPU will result in the same error (sometimes). The driver version is 450.66.

Traceback (most recent call last):
  File "~/run.py", line 28, in <module>
    device_name = torch.cuda.get_device_name(torch.cuda.current_device())
  File "~/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 366, in current_device
    _lazy_init()
  File "~/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

Is there any particular way to diagnose the error so that I can resolve it? Thank you!

ptrblck · May 25, 2021, 5:00am

This issue sounds more like a setup issue than a PyTorch error, so I would recommend to check the setup with the server admin and also take a look at dmesg (in particular search for any xid entries, which could provide more infromation, why CUDA is failing).

AlphaBetaGamma96 · May 25, 2021, 9:49am

Do you have any references for further reading with this? I’ve never used dmesg and xid before to debug a CUDA install! Thank you once again!

ptrblck · May 25, 2021, 4:27pm

This document gives you more information about these error codes.

AlphaBetaGamma96 · May 26, 2021, 1:07pm

Hello again! I’ve been briefly reading through the document you sent me and I’m not 100% sure how to proceed with solving this error. It states in the document you shared, that the Xid entries are located at /var/log/messages/ however that directory does not exist. Is there some other command that needs to be run beforehand?

Also, I ran this command dmesg | grep -e 'NVRM: Xid' to see if any Xid entries appear from dmesg and it returns nothing. The only error that appears is a series of nfs: RPC call returned error 13 errors, but that is all. Could this be a potential issue?

Also, also, could the type of card be an issue as well? Some cards are GTX 745 cards whereas some are more modern Quadro cards.

Thank you for all the help!

ptrblck · May 26, 2021, 6:40pm

Yes, this could play a role in the issue, but it’s hard to tell without a proper error message (unknown error is unfortunately not very helpful ) and apparently you are unable to see any xids.

OoOqn · May 27, 2021, 12:04am

Hi. You could try to reboot the remote PC. The problem may raise because GPU drivers updated recently without rebooting

AlphaBetaGamma96 · May 27, 2021, 12:54pm

It does seem like a bit of a problem! Is there anything else that comes to mind or am I out of luck?

Also, I was wondering if I could ask another question with some errors I get? For some reason I seem to get an issue with loading my model (occasionally).

Traceback (most recent call last):
  File "~/main.py", line 145, in <module>
    state_dict = torch.load(f=model_path_pt, map_location=torch.device(device))
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 595, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 764, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

Traceback (most recent call last):
  File "~/main.py", line 145, in <module>
    state_dict = torch.load(f=model_path_pt, map_location=torch.device(device))
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 853, in _load
    result = unpickler.load()
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 845, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 833, in load_tensor
    storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/67511648: file read failed

For the first model it seems that the file is just 0 mb in size, is that correct? I only say this from reading this thread on stackoverflow here. For the second one, I’m not 100% sure what’s wrong. I did read you’re previous answer here but I’m saving everything within a dictionary rather than saving the model directly like this…

torch.save({'epoch':preepoch,
            'model_state_dict':net.state_dict(),
            'optim_state_dict':optim.state_dict(),
            'loss':mean_preloss,
            'chains':sampler.chains}, model_path_pt)

and then loaded with

state_dict = torch.load(f=model_path_pt, map_location=torch.device(device))
start=state_dict['epoch']+1
net.load_state_dict(state_dict['model_state_dict'])
optim.load_state_dict(state_dict['optim_state_dict'])
loss = state_dict['loss']
sampler.chains = state_dict['chains']

Thank you!

Edit: A follow up question to the PytorchStreamReader error, I save my model each epoch and each epoch takes around 0.3s to do. Is it advisable to save at each epoch or to save every n-th epoch?. Could this be causing the issue with reading a file each 0.3s? Because the error does vary a bit sometimes it’s failed finding central directory, invalid header or archive is corrupted, or file read failed!

AlphaBetaGamma96 · May 27, 2021, 12:55pm

Thank you, @OoOqn! I tried that but not much changed!

ptrblck · May 27, 2021, 7:57pm

You could create a topic in the NVIDIA board following these steps to provide a full log, which might help to isolate the issue further.