I am training a classification model and I have saved some checkpoints. When I try to resume training, however, I got out of memory errors:
Traceback (most recent call last):
File “train.py”, line 283, in
main()
File “train.py”, line 86, in main
optimizer.load_state_dict(checkpoint[‘optimizer’])
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/optim/optimizer.py”, line 96, in load_state_dict
state_dict = deepcopy(state_dict)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 240, in deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/copy.py”, line 161, in deepcopy
y = copier(memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 23, in deepcopy
new_storage = self.storage().deepcopy(memo)
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/storage.py”, line 28, in deepcopy
new_storage = self.clone()
File “/home/haojiedong/tools/anaconda3/lib/python3.6/site-packages/torch/storage.py”, line 42, in clone
return type(self)(self.size()).copy(self)
RuntimeError: CUDA error: out of memory
The error is in optimizer.load_state_dict()
. The code to load the checkpoint is like this:
if args.resume:
if os.path.isfile(args.resume):
print("=> loading checkpoint '{}'".format(args.resume))
checkpoint = torch.load(args.resume)
args.start_epoch = checkpoint['epoch']
best_acc = checkpoint['best_acc']
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
print("=> loaded checkpoint '{}' (epoch {})".format(
args.resume, checkpoint['epoch']))
else:
print("=> no checkpoint found at '{}'".format(args.resume))
I checked the target GPU, it is actually empty. I am currently using pytorch version 0.4.1.