Automatic mixed precision - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) error

I try to do automatic mixed-precision, but unfortunately I get an error that says that all tensors are not on the GPU. When I check the model and all variables it seems ok.
Any idea why it fails during the forward pass, and how to fix it?
(I chopped a lot of code so it’d be the shortest to reproduce the error)
(torch 1.10.2, python 3.6.9, T4 GPU)

import torch

import os
import torch.nn as nn
from torch.optim import SGD
from torchvision import transforms,models
from torchvision.datasets import ImageFolder
from torch.cuda.amp import GradScaler, autocast

dataset_path = ‘/my_source_path’ # under the path I have ‘train’ and ‘test’ folders

device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)
print('device: ', device)

train_transform = transforms.Compose([
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])

train_dataset = ImageFolder(root=os.path.join(dataset_path , ‘train’), transform=train_transform)
train_loader =, batch_size=32, num_workers=2)

model = models.inception_v3(pretrained=True, aux_logits=False).to(device)
model.fc = nn.Linear(2048, 2)
print('model on cuda: ', next(model.parameters()).is_cuda)

criterion = nn.CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.005, momentum=0.9)
scaler = GradScaler()

for epoch in range(3):

print('epoch: ', epoch+1)
for i, data in enumerate(train_loader, 0):
    inputs, labels = data
    if torch.cuda.is_available():
        inputs, labels =,
    print('inputs on cuda: ', inputs.is_cuda, ', type: ', inputs.dtype)
    print('labels on cuda: ', labels.is_cuda, ', type: ', labels.dtype)
    with autocast():
        output = model(inputs)
        print('output on cuda: ', output.is_cuda, ', type: ', output.dtype)
        loss = criterion(output, labels)

Your new module should still be on the CPU since you are:

  • creating the inception_v3` model
  • push it to the GPU
  • replace model.fc with a new nn.Linear on the CPU
  • don’t push it to the device
1 Like

Thanks! Didn’t know the replacement occurs only on the CPU. If I make some changes I later move it to the device right after (didn’t do that here), but good to know that any replacement occurs only on the CPU

Yes, the reason for this is because:

model.fc = nn.Linear(2048, 2)

initializes it with a plain nn.Linear module, which will use the CPU as its default device and you would need to explicitly push the modules to the GPU if needed.
The common approach would be to manipulate your model first, then push the entire model to the GPU once.

1 Like