Why `criterion.cuda()` is not needed but `model.cuda()` is?

When working on GPU, we need to do something similar to:


where x can be a model or input variables.

I was wondering why is it not done for loss criteria?

criterion = nn.MSELoss()

# why is the below line not implemented?
if torch.cuda.is_available(): criterion.cuda()

Does the criterion somehow infer whether or not to use cuda from the model?


The impact of moving a module to cuda is actually to move all it’s parameters to cuda.
Criterion don’t have parameters in general, so it is not necessary to do it.


Thank you.

So does that mean if I implement my own loss function which involves parameters, I should do criterion.cuda() to improve speed?


Yes if it has parameters, you will need to call .cuda() on it if the input you give to it is on the gpu.


Hey @albanD,

Apologies for the direct message.
If I am using a dynamically created tensor for loss calculation, what would be the recommended approach for writing the loss function for optimizing Cuda command use? Ex. -

class KLDLoss(nn.Module):
    """KL-divergence loss between attention weight and uniform distribution"""
    def __init__(self):
        super(KLDLoss, self).__init__()
    def forward(self, attn_val, cluster):
          Input - attention value = torch.tensor([0.05, 0.1, 0.05, 0.1, 0.05, 0.1, 0.05, 0.05, 
                                0.1, 0.05, 0.1, 0.05, 0.1, 0.05])
                  cluster = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2]
          Output - 0.0043
        kld_loss = 0
        cluster = np.array(cluster)
        for cls in np.unique(cluster):
            index = np.where(cluster==cls)[0]
            if torch.cuda.is_available():
                kld_loss += F.kl_div(F.log_softmax(attn_val[index], dim=0)[None],\
                                torch.ones(len(index), 1)[None].cuda()/len(index),\
                                reduction = 'batchmean')
                kld_loss += F.kl_div(F.log_softmax(attn_val[index], dim=0)[None],\
                                torch.ones(len(index), 1)[None]/len(index),\
                                reduction = 'batchmean')                
        return kld_loss



Side note, you might want to use the builtin KL: torch.nn.functional — PyTorch 1.8.0 documentation

From your code, I think the main issue is that you’re using numpy arrays (which cannot be on GPU).
You might want to keep everything as Tensors and when you create new Tensors, you can pass the device= kwarg. In this case, you want to match the device of the inputs I guess so attn_val.device.

1 Like