Code that loads SGD fails to load Adam state to GPU

Strange issue. I was using SGD on my model, and saved and loaded the optimiser state fine on GPU. But replacing SGD with Adam suddenly complains that it’s internal state requires using CPU (using Pytorch 1.2, CUDA 10):

Traceback (most recent call last):
File “/root/network/”, line 139, in train
File “/opt/conda/lib/python3.6/site-packages/torch/optim/”, line 93, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: expected device cpu and dtype Float but got device cuda:0 and dtype Float

The very same code saves and loads SGD optimiser state without problems. Should I move and load Adam somewhat differently? This is the code that loads the state and trains:

  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  model = MyModel()

  # optimiser init
  optimiser = torch.optim.Adam(model.parameters(), lr=0.1)
  first_epoch = 0

  if load_model != None:
    checkpoint = torch.load(load_model, map_location=device)
    first_epoch = checkpoint['epoch'] + 1
  for epoch in range():
     # training here
     model(x, y)
     optimiser.step() # this is where the error happens

This is how I save the state:{
              'epoch': epoch,
              'model_state_dict': model.state_dict(),
              'optimiser_state_dict': optimiser.state_dict(),
              'loss': epoch_loss,
              }, model_file)
        print("Model saved:", model_file)

I see from the error message that is is caused by internal parameters of Adam, but as I am already loading checkpoint to GPU, I cannot understand, why?

1 Like

The internal states might have been stored as CUDATensors, if you’ve pushed the model to the GPU in your previous run.
Does your code work, if you push the model to the GPU again before initializing the optimizer?


I think you’re right.
In, this note is mentioned:

If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call.
In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.


Thank you! Indeed, pushing the model to GPU before loading the state of optimiser solved the problem. Though, I would say that the error message should pop up upon loading the state dict, not on step(). Thanks again @ptrblck and @Upgrade_Yourself!


I have created a model class like this:

class model_parallel(nn.Module):
   def __init__(self):
      sub_net_1 = models.resnet50(True)
      # sub_net_1 = torch.nn.Sequential(*(list(sub_net_1.children())[:-3]))
      for param in sub_net_1.parameters():
         param.requires_grad = False
      sub_net_2 = nn.Sequential(nn.Linear(in_features=1000, out_features=500,bias=True),
      nn.Linear(in_features=500, out_features=100,bias=True),
      nn.Linear(in_features=100, out_features=67,bias=True))
      self.sub_network1 = sub_net_1.cuda(0)
      self.sub_network2 = sub_net_2.cuda(1)

   def forward(self, x):
      x = x.cuda(0)
      x = self.sub_network1(x)
      # print(x.shape)
      x = x.cuda(1)
      x = self.sub_network2(x)
      return x

However, I am getting

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat1 in method wrapper_addmm)


x = self.sub_network2(x)

What am I doing wrong here?

Your code works fine using:

model = model_parallel()
x = torch.randn(64, 3, 224, 224, device='cuda')
out = model(x)
> torch.Size([64, 67])
1 Like