Model seems to compute model "score" on CPU from tensors on the GPU: "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"

I am training a Sequence to Sequence LSTM model. The problem is that during the training loop, the loss calculation is complaining with the error: “Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!”.

So it seems like the computing the “score” from the model generates a tensor on the CPU instead of the GPU. I can fix this by just adjusting the line scores = model(data, targets) to scores = model(data, targets).to(device), but that seems like an unnecessary passing of a tensor back and forth from the GPU to the CPU and then back to the GPU.

model = Seq2SeqPF(encoder_net, decoder_net).to(device)

load_from_checkpoint = False

if load_from_checkpoint:
    load_checkpoint(torch.load(os.path.join(CHECKPOINT_DIRECTORY, CHECKPOINT_NAME)),model, device)

for epoch in range(EPOCHS):
    print(f"Epoch: {epoch + 1}/{EPOCHS}")
    kbar = pkbar.Kbar(target=batches_per_epoch, width=8)

    if epoch % 5 == 0:
      checkpoint = {'state_dict': model.state_dict(), 
                    'optimizer': optimizer.state_dict()}
      save_checkpoint(checkpoint, 
                      CHECKPOINT_DIRECTORY, 
                      CHECKPOINT_NAME)

    for batch_idx, (data, targets) in enumerate(train_loader):
        
        data = data.to(device=device)
        targets = targets.to(device=device)

        # forward pass and compute error
        scores = model(data, targets) 
        loss = criterion(scores, targets)    # <---GENERATES ERROR ABOUT CPU AND GPU

    # backward pass and apply gradients.
        optimizer.zero_grad()
        loss.backward()

        # gradient descent step
        optimizer.step()

        kbar.update(batch_idx, values=[("loss", loss)])

The code seems standard. I also did push the model itself to the GPU device, and the encoder_net, and decoder_net are also network layers that are pushed to the GPU before the model code is pushed to the GPU.

Any suggestions on the right way to handle this?

Could there be a missing criterion = criterion.cuda() somewhere?

@eqy Oh that makes sense. I did not know I needed that. I did not realize that I needed to explicitly indicate where the criterion is computed. Hmm. Thanks for finding that for me. I appreciate it.