In the last two days, I have often encountered CUDA error when loading the pytorch model: out of memory.
I didn’t change any code, but the error just come from nowhere.
It can work just one hour ago.
I use nvidia-smi to check the GPU also seems fine.
Traceback (most recent call last):
File "test.py", line 39, in <module>
top_model.load_state_dict(torch.load('pkls/True_fuck_feature_top_model_basic_loss_loss_10.pkl'))
File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 367, in load
return _load(f, map_location, pickle_module)
File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 538, in _load
result = unpickler.load()
File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 504, in persistent_load
data_type(size), location)
File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 113, in default_restore_location
result = fn(storage, location)
File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 95, in _cuda_deserialize
return obj.cuda(device)
File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/_utils.py", line 76, in _cuda
return new_type(self.size()).copy_(self, non_blocking)
File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/cuda/__init__.py", line 496, in _lazy_new
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
Did you update any libraries between the last successful run?
Did you specify any devices using CUDA_VISIBLE_DEVICES?
Is the GPU completely empty before you run the script?
Hey @ptrblck and @cbox, I am currently running into this issue and am wondering how it was resolved.
Did you update any libraries between the last successful run?
I am using the same conda environment, so no change in libraries.
Did you specify any devices using CUDA_VISIBLE_DEVICES ?
I am just specifying the device via: device = torch.device('cuda:4')
I am still pretty green here so I am not really sure what the difference is, however, this is the first time I have run into a situation of running out of memory on one of these particular models. (in fact, a different GPU in the cluster is currently training the exact same model).
Is the GPU completely empty before you run the script?
I also faced this problem today, and solved it by loading on ‘cpu’ first.
Later, I think the reason might be that the model was trained and saved from my gpu 0, and I tried to load it using my gpu 1. At the same time, my gpu 0 was doing something else and had no memory left. I guess that’s why loading the model on “cpu” first and sending to “gpu 1” fix the problem.