RuntimeError: CUDA unknown error - Setting available devices to be zero

It does seem like a bit of a problem! Is there anything else that comes to mind or am I out of luck? :stuck_out_tongue:

Also, I was wondering if I could ask another question with some errors I get? For some reason I seem to get an issue with loading my model (occasionally).

Traceback (most recent call last):
  File "~/main.py", line 145, in <module>
    state_dict = torch.load(f=model_path_pt, map_location=torch.device(device))
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 595, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 764, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

Traceback (most recent call last):
  File "~/main.py", line 145, in <module>
    state_dict = torch.load(f=model_path_pt, map_location=torch.device(device))
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 853, in _load
    result = unpickler.load()
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 845, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "~/.local/lib/python3.6/site-packages/torch/serialization.py", line 833, in load_tensor
    storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/67511648: file read failed

For the first model it seems that the file is just 0 mb in size, is that correct? I only say this from reading this thread on stackoverflow here. For the second one, I’m not 100% sure what’s wrong. I did read you’re previous answer here but I’m saving everything within a dictionary rather than saving the model directly like this…

torch.save({'epoch':preepoch,
            'model_state_dict':net.state_dict(),
            'optim_state_dict':optim.state_dict(),
            'loss':mean_preloss,
            'chains':sampler.chains}, model_path_pt)

and then loaded with

state_dict = torch.load(f=model_path_pt, map_location=torch.device(device))
start=state_dict['epoch']+1
net.load_state_dict(state_dict['model_state_dict'])
optim.load_state_dict(state_dict['optim_state_dict'])
loss = state_dict['loss']
sampler.chains = state_dict['chains']

Thank you!

Edit: A follow up question to the PytorchStreamReader error, I save my model each epoch and each epoch takes around 0.3s to do. Is it advisable to save at each epoch or to save every n-th epoch?. Could this be causing the issue with reading a file each 0.3s? Because the error does vary a bit sometimes it’s failed finding central directory, invalid header or archive is corrupted, or file read failed!