Out of memory error when resume training even though my GPU is empty


(jdhao) #1

I am training a classification model and I have saved some checkpoints. When I try to resume training, however, I got out of memory errors:

Traceback (most recent call last):
File “train.py”, line 283, in
main()
File “train.py”, line 86, in main
optimizer.load_state_dict(checkpoint[‘optimizer’])
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/optim/optimizer.py”, line 96, in load_state_dict
state_dict = deepcopy(state_dict)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 161, in deepcopy
y = copier(memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 23, in deepcopy
new_storage = self.storage().deepcopy(memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/storage.py”, line 28, in deepcopy
new_storage = self.clone()
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/storage.py”, line 42, in clone
return type(self)(self.size()).copy
(self)
RuntimeError: CUDA error: out of memory

The error is in optimizer.load_state_dict(). The code to load the checkpoint is like this:

if args.resume:                                                      
    if os.path.isfile(args.resume):                                  
        print("=> loading checkpoint '{}'".format(args.resume))      
        checkpoint = torch.load(args.resume)                         
        args.start_epoch = checkpoint['epoch']                       
        best_acc = checkpoint['best_acc']                            
        model.load_state_dict(checkpoint['state_dict'])              
        optimizer.load_state_dict(checkpoint['optimizer'])           
        print("=> loaded checkpoint '{}' (epoch {})".format(         
            args.resume, checkpoint['epoch']))                       
    else:                                                            
        print("=> no checkpoint found at '{}'".format(args.resume))  

I checked the target GPU, it is actually empty. I am currently using pytorch version 0.4.1.


#2

Could you try to load the checkpoints onto the CPU first using the map_location argument in torch.load?
After it was successful, try to push your model onto the GPU again.


(Ray Luo) #4

Thanks. It solved the problem!
checkpoint = torch.load(path, map_location='cpu')