Out of memory error when resume training even though my GPU is empty

I am training a classification model and I have saved some checkpoints. When I try to resume training, however, I got out of memory errors:

Traceback (most recent call last):
File “train.py”, line 283, in
main()
File “train.py”, line 86, in main
optimizer.load_state_dict(checkpoint[‘optimizer’])
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/optim/optimizer.py”, line 96, in load_state_dict
state_dict = deepcopy(state_dict)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 161, in deepcopy
y = copier(memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 23, in deepcopy
new_storage = self.storage().deepcopy(memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/storage.py”, line 28, in deepcopy
new_storage = self.clone()
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/storage.py”, line 42, in clone
return type(self)(self.size()).copy
(self)
RuntimeError: CUDA error: out of memory

The error is in optimizer.load_state_dict(). The code to load the checkpoint is like this:

if args.resume:                                                      
    if os.path.isfile(args.resume):                                  
        print("=> loading checkpoint '{}'".format(args.resume))      
        checkpoint = torch.load(args.resume)                         
        args.start_epoch = checkpoint['epoch']                       
        best_acc = checkpoint['best_acc']                            
        model.load_state_dict(checkpoint['state_dict'])              
        optimizer.load_state_dict(checkpoint['optimizer'])           
        print("=> loaded checkpoint '{}' (epoch {})".format(         
            args.resume, checkpoint['epoch']))                       
    else:                                                            
        print("=> no checkpoint found at '{}'".format(args.resume))  

I checked the target GPU, it is actually empty. I am currently using pytorch version 0.4.1.

Could you try to load the checkpoints onto the CPU first using the map_location argument in torch.load?
After it was successful, try to push your model onto the GPU again.

11 Likes

Thanks. It solved the problem!
checkpoint = torch.load(path, map_location='cpu')

2 Likes

This did not work for me. I am encountering the same issue. I am able to train on 1 fold, then when I reset state_dicts for model, optimizer and scheduler, I get OOM error at optimizer.step()

pytorch version '1.10.0+cu102'

If I understood correctly, you suggested

checkpoint = torch.load(..., map_location='cpu')
model.load_state_dict(checkpoint['state_dict'])
model = model.cuda()

This gets out of memory at optimizer.step() after training successfully on 1 fold.

This would mean that additional tensors are most likely pushed or created on the device, so compare the memory usage before resetting the states and after each step of loading the model to isolate where the memory overhead is coming from.

@ptrblck I found the issue using nvidia-smi; it was related to AWS sagemaker /jupyter notebook leaving orphaned processes on the GPU which was why there wasn’t enough memory available. Unsure why there were orphaned processes on the GPU.

1 Like