Hi all,
I am trying to resume training from a pretrained Resnet-50 model, which is a 3dcnn model that has been initialized with Kinetics weights https://github.com/kenshohara/3D-ResNets-PyTorch. Initial training (5 epochs) has been done on a cuda device which has PyTorch version 1.0.0. My model and loss function is sent to current device, before starting training.
I am unable to resume training on the same cuda machine, but I am able to resume it on a cpu device.
The exact error output is as follows,
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
I save states as follows,
best_ckpt_path = os.path.join('checkpoint-best.tar')
states = {
'epoch': epoch + 1,
'optimizer': optimizer.state_dict(),
'state_dict': model.state_dict(),
}
torch.save(states, best_ckpt_path)
and I load,
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
parameters = self.get_fine_tuning_parameters(opt.param_dict_list,
opt.learning_rate,
opt.weight_decay)
if opt.optim.lower() in ['adam']:
optimizer = optim.Adam(parameters, lr=opt.learning_rate, weight_decay=opt.weight_decay)
elif opt.optim.lower() in ['sgd']:
optimizer = optim.SGD(parameters, lr=opt.learning_rate, momentum=0.9, weight_decay=opt.weight_decay)
else:
raise ValueError('Invalid optimizer type string.')
self.optimizer = optimizer
if os.path.isfile(self.resume_path):
print('resuming model from checkpoint {}'.format(self.resume_path))
if self.device.type in ['cpu']:
checkpoint = torch.load(self.resume_path, map_location=self.device)
else:
checkpoint = torch.load(self.resume_path)
self.model.load_state_dict(checkpoint['state_dict'])
self.optimizer.load_state_dict(checkpoint['optimizer'])
for state in self.optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
print("device: {}".format(v.device))
self.begin_epoch = checkpoint['epoch']
The print output of the above code block is cpu for each optimizer tensor, both for the cuda device and cpu device. So this shows that optimizer related tensors are already placed on cpu. So, how can I receive a got torch.cuda.FloatTensor error ?