Model seems to compute model "score" on CPU from tensors on the GPU: "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"

krishnab · May 19, 2021, 4:31pm

I am training a Sequence to Sequence LSTM model. The problem is that during the training loop, the loss calculation is complaining with the error: “Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!”.

So it seems like the computing the “score” from the model generates a tensor on the CPU instead of the GPU. I can fix this by just adjusting the line scores = model(data, targets) to scores = model(data, targets).to(device), but that seems like an unnecessary passing of a tensor back and forth from the GPU to the CPU and then back to the GPU.

model = Seq2SeqPF(encoder_net, decoder_net).to(device)

load_from_checkpoint = False

if load_from_checkpoint:
    load_checkpoint(torch.load(os.path.join(CHECKPOINT_DIRECTORY, CHECKPOINT_NAME)),model, device)

for epoch in range(EPOCHS):
    print(f"Epoch: {epoch + 1}/{EPOCHS}")
    kbar = pkbar.Kbar(target=batches_per_epoch, width=8)

    if epoch % 5 == 0:
      checkpoint = {'state_dict': model.state_dict(), 
                    'optimizer': optimizer.state_dict()}
      save_checkpoint(checkpoint, 
                      CHECKPOINT_DIRECTORY, 
                      CHECKPOINT_NAME)

    for batch_idx, (data, targets) in enumerate(train_loader):
        
        data = data.to(device=device)
        targets = targets.to(device=device)

        # forward pass and compute error
        scores = model(data, targets) 
        loss = criterion(scores, targets)    # <---GENERATES ERROR ABOUT CPU AND GPU

    # backward pass and apply gradients.
        optimizer.zero_grad()
        loss.backward()

        # gradient descent step
        optimizer.step()

        kbar.update(batch_idx, values=[("loss", loss)])

The code seems standard. I also did push the model itself to the GPU device, and the encoder_net, and decoder_net are also network layers that are pushed to the GPU before the model code is pushed to the GPU.

Any suggestions on the right way to handle this?

eqy · May 19, 2021, 8:34pm

Could there be a missing criterion = criterion.cuda() somewhere?

krishnab · May 19, 2021, 9:14pm

@eqy Oh that makes sense. I did not know I needed that. I did not realize that I needed to explicitly indicate where the criterion is computed. Hmm. Thanks for finding that for me. I appreciate it.