Why `criterion.cuda()` is not needed but `model.cuda()` is?

nivter · May 3, 2018, 3:25am

When working on GPU, we need to do something similar to:

x.cuda()

where x can be a model or input variables.

I was wondering why is it not done for loss criteria?

criterion = nn.MSELoss()

# why is the below line not implemented?
if torch.cuda.is_available(): criterion.cuda()

Does the criterion somehow infer whether or not to use cuda from the model?

albanD · May 3, 2018, 12:41pm

The impact of moving a module to cuda is actually to move all it’s parameters to cuda.
Criterion don’t have parameters in general, so it is not necessary to do it.

nivter · May 7, 2018, 1:54am

Thank you.

So does that mean if I implement my own loss function which involves parameters, I should do criterion.cuda() to improve speed?

albanD · May 7, 2018, 8:43am

Yes if it has parameters, you will need to call .cuda() on it if the input you give to it is on the gpu.

yashsharma0906 · March 11, 2021, 3:55pm

Hey @albanD,

Apologies for the direct message.
If I am using a dynamically created tensor for loss calculation, what would be the recommended approach for writing the loss function for optimizing Cuda command use? Ex. -

class KLDLoss(nn.Module):
    """KL-divergence loss between attention weight and uniform distribution"""
    def __init__(self):
        super(KLDLoss, self).__init__()
        
    def forward(self, attn_val, cluster):
        """
        Example:
          Input - attention value = torch.tensor([0.05, 0.1, 0.05, 0.1, 0.05, 0.1, 0.05, 0.05, 
                                0.1, 0.05, 0.1, 0.05, 0.1, 0.05])
                  cluster = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2]
          Output - 0.0043
        """
        
        kld_loss = 0
        cluster = np.array(cluster)
        for cls in np.unique(cluster):
            index = np.where(cluster==cls)[0]
            if torch.cuda.is_available():
                kld_loss += F.kl_div(F.log_softmax(attn_val[index], dim=0)[None],\
                                torch.ones(len(index), 1)[None].cuda()/len(index),\
                                reduction = 'batchmean')
            else:
                kld_loss += F.kl_div(F.log_softmax(attn_val[index], dim=0)[None],\
                                torch.ones(len(index), 1)[None]/len(index),\
                                reduction = 'batchmean')                
        return kld_loss

Thanks!

albanD · March 11, 2021, 4:11pm

Hey,

Side note, you might want to use the builtin KL: torch.nn.functional — PyTorch 1.8.0 documentation

From your code, I think the main issue is that you’re using numpy arrays (which cannot be on GPU).
You might want to keep everything as Tensors and when you create new Tensors, you can pass the device= kwarg. In this case, you want to match the device of the inputs I guess so attn_val.device.