"RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor" while resuming training

Hi all,

I am trying to resume training from a pretrained Resnet-50 model, which is a 3dcnn model that has been initialized with Kinetics weights https://github.com/kenshohara/3D-ResNets-PyTorch. Initial training (5 epochs) has been done on a cuda device which has PyTorch version 1.0.0. My model and loss function is sent to current device, before starting training.
I am unable to resume training on the same cuda machine, but I am able to resume it on a cpu device.
The exact error output is as follows,

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor

I save states as follows,

        best_ckpt_path = os.path.join('checkpoint-best.tar')
        states = {
            'epoch': epoch + 1,
            'optimizer': optimizer.state_dict(),
            'state_dict': model.state_dict(),
        }
        torch.save(states, best_ckpt_path)

and I load,

        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        parameters = self.get_fine_tuning_parameters(opt.param_dict_list,
                                                     opt.learning_rate,
                                                     opt.weight_decay)

        if opt.optim.lower() in ['adam']:
            optimizer = optim.Adam(parameters, lr=opt.learning_rate, weight_decay=opt.weight_decay)
        elif opt.optim.lower() in ['sgd']:
            optimizer = optim.SGD(parameters, lr=opt.learning_rate, momentum=0.9, weight_decay=opt.weight_decay)
        else:
            raise ValueError('Invalid optimizer type string.')

        self.optimizer = optimizer

        if os.path.isfile(self.resume_path):
            print('resuming model from checkpoint {}'.format(self.resume_path))

            if self.device.type in ['cpu']:
                checkpoint = torch.load(self.resume_path, map_location=self.device)
            else:
                checkpoint = torch.load(self.resume_path)

            self.model.load_state_dict(checkpoint['state_dict'])
            self.optimizer.load_state_dict(checkpoint['optimizer'])
            for state in self.optimizer.state.values():
                for k, v in state.items():
                    if isinstance(v, torch.Tensor):
                        print("device: {}".format(v.device))

            self.begin_epoch = checkpoint['epoch']

The print output of the above code block is cpu for each optimizer tensor, both for the cuda device and cpu device. So this shows that optimizer related tensors are already placed on cpu. So, how can I receive a got torch.cuda.FloatTensor error ?

Hi

Based on your error message, I guess that the data-types of the tensors in the momentum_buffer of the optimizer are on the wrong device.

You can try to run this snippet to move the buffers:

print(optimizer.state[list(optimizer.state.keys())[0]])

for p in optimizer.state.keys():
    param_state = optimizer.state[p]
    buf = param_state["momentum_buffer"]
    param_state["momentum_buffer"] = buf.cuda()  # move buf to device

print(optimizer.state[list(optimizer.state.keys())[0]])

See here how momentum_buffers are used in the SGD optimizer.

Hope this helps.

1 Like

Hi,
thanks a lot for the solution. I was not aware of the momentum buffers. And I haven’t specially treated them in my previous work. Why do you think this happened? To my knowledge, I only need to push the model and loss function to current device. Could you describe the best practice in this case please?

I’m not sure. If you call torch.load it should restore the tensors on the device prior to saving.
Optimizer also casts the tensors to the respective device.

I can’t really tell without the complete code.

I had a similar issue and I think it is related to your optimizer referencing model parameters that are not yet loaded on GPU. In my case I fixed the issue by changing the following:

optimizer = torch.optim.Adam(model.parameters(), lr=params.lr)
try:
       checkpoint = torch.load("checkpoints/"+model_folder+"checkpoint.pth.tar")
       start_epoch = checkpoint['epoch']
       scheduler.load_state_dict(checkpoint['scheduler'])
       model.load_state_dict(checkpoint['state_dict'])
       model.to(device) # moving this line here fixed the issue
       optimizer.load_state_dict(checkpoint['optimizer'])
       …
       # it used to be somewhere around here

So in your case a cleaner solution would be to send that parameters object to GPU (not sure how but there must be some one-liner that does that) BEFORE loading the dictionary, else it seems that it will be loaded to the usual memory instead of GPU memory.