Hi everyone,
I’m training a model using PyTorch and while running the train function I encounter the following error message:
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
During the run, I noticed that after the first epoch my tensor changes its working device to CPU from the GPU, as can be seen here:
Check if the data was properly moved to the GPU as this error indicates a device mismatch in the model execution while the parameters of the model seem to be on the GPU already.
At each training iteration I’m moving both the data and the model to the GPU as in the attached code:
def train(num_epochs, model, optimizer, loss_fn, train_loader):
best_accuracy = 0.0
# Define your execution device
device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)
print(“The model will be running on”, device, “device”)
# Convert model parameters and buffers to CPU or Cuda
for epoch in range(num_epochs): # loop over the dataset multiple times
running_loss = 0.0
running_acc = 0.0
for i, (images, labels) in enumerate(tqdm(train_loader, 0)):
model = model.to(torch.device('cuda'))
# get the inputs
images = images.to(torch.device('cuda'))
labels = labels.to(torch.device('cuda'))
# zero the parameter gradients
optimizer.zero_grad()
# predict classes using images from the training set
outputs = model(images)
# compute the loss based on model output and real labels
loss = loss_fn(outputs, torch.max(labels, 1)[1])
# backpropagate the loss
loss.backward()
# adjust parameters based on the calculated gradients
optimizer.step()
# Let's print statistics for every 1,000 images
running_loss += loss.item() # extract the loss value
if i % 10 == 0:
# print every 1000 (twice per epoch)
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 1000))
# zero the loss
running_loss = 0.0
sleep(0.1)
# Compute and print the average accuracy fo this epoch when tested over all 10000 test images
accuracy = test_accuracy(model, train_loader)
print('For epoch', epoch + 1, 'the test accuracy over the whole test set is %d %%' % (accuracy))
# we want to save the model if the accuracy is the best
if accuracy > best_accuracy:
save_model()
best_accuracy = accuracy
Or should I do it before? while creating the Dataset ?
from the DataLoader loop as the model should be moved once to the device before the training starts.
Are you creating any tensors in the forward method without moving them to the GPU and could you also check the validation or test loop and make sure the data is also moved to the GPU there?
If you get stuck, could you post a minimal, executable code snippet reproducing the issue, please?
I will try to see if it works during test or validation, the line in which I get the error is in bold and occurs right after the first epoch is about to end. I tried moving my input within the forward method to the GPU but there was no effect… I’m attaching my forward method with the whole model class:
I think I managed to find the problem, while sending the Dataloader to the accuracy function I did not change the device which may cause the crash. checking it now, will update shortly.
thank you !
I managed to find what the problem is - as I mentioned in another comment of mine the problem came from the Dataloader - I didn’t move it onto the GPU in the accuracy function. After doing so I’m now getting the following Error:
RuntimeError: The size of tensor a (64) must match the size of tensor b (7) at non-singleton dimension 1
The function:
def test_accuracy(model, test_loader):
model.eval()
acc = 0.0
total = 0.0
with torch.no_grad():
for data in test_loader:
images, labels = data
images = images.to(torch.device('cuda'))
labels = labels.to(torch.device('cuda'))
# run the model on the test set to predict labels
outputs = model(images)
# the label with the highest energy will be our prediction
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
acc += (predicted == labels).sum().item()
# compute the accuracy over all test images
acc = (100 * acc / total)
return (acc)
I guess the error is raised in the accuracy calculation:
predicted = torch.randint(0, 10, (2, 64))
labels = torch.randint(0, 10, (2, 7))
(predicted == labels).sum().item()
# RuntimeError: The size of tensor a (64) must match the size of tensor b (7) at non-singleton dimension 1
so check the shapes of these tensors and make sure you can compare them.