CUDA error: out of memory when load models

cbox · February 23, 2019, 7:01am

In the last two days, I have often encountered CUDA error when loading the pytorch model: out of memory.
I didn’t change any code, but the error just come from nowhere.
It can work just one hour ago.
I use nvidia-smi to check the GPU also seems fine.

Traceback (most recent call last):
  File "test.py", line 39, in <module>
    top_model.load_state_dict(torch.load('pkls/True_fuck_feature_top_model_basic_loss_loss_10.pkl'))
  File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 367, in load
    return _load(f, map_location, pickle_module)
  File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 538, in _load
    result = unpickler.load()
  File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 504, in persistent_load
    data_type(size), location)
  File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 113, in default_restore_location
    result = fn(storage, location)
  File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/serialization.py", line 95, in _cuda_deserialize
    return obj.cuda(device)
  File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/_utils.py", line 76, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/home1/cbx/anaconda3/envs/torch_1.0/lib/python3.7/site-packages/torch/cuda/__init__.py", line 496, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

ptrblck · February 23, 2019, 7:15pm

Did you update any libraries between the last successful run?
Did you specify any devices using CUDA_VISIBLE_DEVICES?
Is the GPU completely empty before you run the script?

nasTnate · March 2, 2020, 9:21pm

Hey @ptrblck and @cbox, I am currently running into this issue and am wondering how it was resolved.

Did you update any libraries between the last successful run?

I am using the same conda environment, so no change in libraries.

Did you specify any devices using CUDA_VISIBLE_DEVICES ?

I am just specifying the device via: device = torch.device('cuda:4')
I am still pretty green here so I am not really sure what the difference is, however, this is the first time I have run into a situation of running out of memory on one of these particular models. (in fact, a different GPU in the cluster is currently training the exact same model).

Is the GPU completely empty before you run the script?

The GPU I am targeting is idle (10MiB/16280MiB)

I would appreciate any advice. Thanks!

EDIT: I solved the problem via this previous solution Out of memory error when resume training even though my GPU is empty - #2 by ptrblck

For anyone with a similar problem, the code I used for a fix is as follows:

model_path = 'path/to/model.pt'
model = UNet(n_channels = 1, n_classes = 1)
state_dict = torch.load(model_path,map_location='cpu')
model.load_state_dict(state_dict)
model.to(device)

That said, I am still curious why the work-around is necessary…

Minseok · March 3, 2020, 3:08pm

How much is the model size? (like the one from one hour ago)
Anything happened?
Any code snippet we can check?

whoab · April 30, 2020, 6:12am

Check this thread: Out of memory error when resume training even though my GPU is empty

You need to load onto CPU first. Someone should file a bug report about this…

Bruce_Wu · May 12, 2020, 9:03pm

I also faced this problem today, and solved it by loading on ‘cpu’ first.
Later, I think the reason might be that the model was trained and saved from my gpu 0, and I tried to load it using my gpu 1. At the same time, my gpu 0 was doing something else and had no memory left. I guess that’s why loading the model on “cpu” first and sending to “gpu 1” fix the problem.

codeislife99 · March 27, 2022, 10:57pm

I had this same issue. This exists even to this date sadly. I had to do

state_dict = torch.load("model_path.pth",map_location=lambda storage, loc: storage)

which is similar to the above answers