Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

prem9 · November 5, 2020, 7:11am

Loading model from torchvision

model = models.segmentation.deeplabv3_resnet50(pretrained=True, progress=False, num_classes=21, aux_loss=None)
model.classifier[4] = nn.Conv2d(256, 1, (1, 1), (1, 1))
model.aux_classifier[4] = nn.Conv2d(256, 1, (1, 1), (1, 1))

Freeze initial layer for finetuning

idx = 0
for name, param in model.named_parameters():
# print(idx, name)
if idx < 129:
param.requires_grad_ = False
else:
break
idx += 1

Defining device to cuda if GPU is avaliable else to cpu

device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)

Defining loss criterion

criterion = FocalTverskyLoss(alpha=0.9)

Defining optimizer to update model params, Adam’s a good default

optimizer = optim.Adam(model.parameters())

Learning rate scheduler to update lr when loss stops improving

scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2)

Path at which to store model and required configs

model_save_path = ‘Models/’

If model file exits load previous training configs (model and other stuff) to continue training else start from 1

if os.path.exists(model_save_path + ‘deeplabv3_resnet50_train.pt’):
checkpoint = torch.load(model_save_path + ‘deeplabv3_resnet50_train.pt’)
start = checkpoint[‘epoch’] + 1
min_val_loss = checkpoint[‘val_loss’]
model.load_state_dict(checkpoint[‘model_state_dict’])
optimizer.load_state_dict(checkpoint[‘optimizer_state_dict’])
scheduler.load_state_dict(checkpoint[‘scheduler_state_dict’])
else:
start = 1
min_val_loss = 100

if torch.cuda.device_count() > 1:
print("Let’s use ",torch.cuda.device_count(),“GPUS!”)
model = nn.DataParallel(model, device_ids=[0,1]).cuda()

Move model to device

model.to(device)

Total epochs to train for

epochs = 100

Training loop

for epoch in range(start, epochs + 1):
train_loss = 0.
val_loss = 0.

# Switch model to training mode
model.train()

# Forward pass through dataset
for imgs, masks in train_loader:
    # Move batch data to device
    imgs, masks = imgs.to(device), masks.to(device)
    
    # Clear previous gradients
    optimizer.zero_grad()
    
    # Forward pass the batch and get predictions
    preds = model(imgs)['out']
    
    # Calculate Loss
    loss = criterion(preds, masks)
    
    # Add to calculate loss for whole dataset
    train_loss += (loss.item() * imgs.size(0))
    
    # Backpropagate gradients
    loss.backward()
    
    # Make weight updates
    optimizer.step()
    
    # Empty cuda cache to clear useless data from VRAM for better utilization
    torch.cuda.empty_cache()

# Switch model to inference mode
model.eval()

# Forward pass through validation dataset
with torch.no_grad():
    for imgs, masks in val_loader:
        imgs, masks = imgs.to(device), masks.to(device)
        
        preds = model(imgs)['out']
        
        loss = criterion(preds, masks)
        val_loss += (loss.item() * imgs.size(0))
        
        torch.cuda.empty_cache()
  
# Average loss on the dataset
train_loss /= len(train_dataset)
val_loss /= len(val_dataset)

# Change lr if loss is not improving
scheduler.step(val_loss)

# If loss is decreasing then store model and other configs in file else not    
if val_loss < min_val_loss:
    min_val_loss = val_loss
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict' : scheduler.state_dict(),
        'val_loss' : min_val_loss
        }, model_save_path + 'deeplabv3_resnet50_train.pt')
    torch.save(model, model_save_path + 'deeplabv3_resnet50_infer.pt')
    
# Print epoch number and train loss
print('Epoch {}:\tTrain Loss: {}\tVal Loss: {}'.format(epoch, train_loss, val_loss))

It gives error Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

ptrblck · November 5, 2020, 10:46am

Could you post the complete error message so that we could use the stack trace to narrow down the line of code, which raises the error?

You can post code snippets by wrapping them into three backticks ```, which makes debugging easier

prem9 · November 5, 2020, 11:39am

RuntimeError Traceback (most recent call last)
in
28
29 # Make weight updates
—> 30 optimizer.step()
31
32 # Empty cuda cache to clear useless data from VRAM for better utilization

~/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
13 def decorate_context(*args, **kwargs):
14 with self:
—> 15 return func(*args, **kwargs)
16 return decorate_context
17

~/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/optim/adam.py in step(self, closure)
97
98 # Decay the first and second moment running average coefficient
—> 99 exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
100 exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
101 if amsgrad:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

jipson7 · January 25, 2021, 3:07pm

I have a very similar error. The line causing it is optimizer.step(), but it only seems to happen after loading from a checkpoint. I am sure to move the model to the proper device after loading, is there something else we need to call .to on?

jipson7 · January 25, 2021, 3:18pm

This worked for me: optimizer load_state_dict() problem? · Issue #2830 · pytorch/pytorch · GitHub

Oshrin · April 7, 2021, 11:46am

I have a same error like you. I try to firstly exec model.cuda() and then exec model.load_state_dict(...) optimizer.load_state_dict(...). It works for me.

D_c_Anh_Dao · April 25, 2021, 7:45pm

thanks you alot. I have the same issue. It work for me.

EMU1337X · August 19, 2021, 3:33am

I love u bro, truly. works for me well.

Chetan_Annam · October 7, 2023, 10:19am

Thanks a lot, this solution works for me. But I still don’t understand why this is working