Move the loss function to GPU

Hi, every one,

I have a question about the “.cuda()”. In an example of Pytorch, I saw that there were the code like this:

criterion = nn.CrossEntropyLoss().cuda()

In my code, I don’t do this. So I am wondering if it necessary to move the loss function to the GPU.



If your input tensor is a cuda tensor, it will run the cuda loss function.


Additionally to what @royboy said, you need to push your criterion to the GPU, if it’s stateful, i.e. if it has some parameters or internal states.
Usually loss functions are just functional so that it is not necessary.


criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

1 Like

@ptrblck Could you explain what do you mean by ‘if the criterion is stateful or if it has some parameters or some internal states’? I am not sure what does internal states mean

A weight parameter could be seen as an internal state and would yield a device mismatch error.
Of course you might define the weight parameter as a CUDATensor, but you could also move the criterion to the device:

output = torch.randn(10, 10, requires_grad=True, device='cuda')
target = torch.randint(0, 10, (10,), device='cuda')

weight = torch.empty(10).uniform_(0, 1)
criterion = nn.CrossEntropyLoss(weight=weight)

loss = criterion(output, target) # error
> RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'weight' in call to _thnn_nll_loss_forward

loss = criterion(output, target) # works

@ptrblck Right, I do not understand why did you give the weight parameter as an input to loss function in criterion = nn.CrossEntropyLoss(weight=weight) ? I have never seen anyone feeding loss function with weight.

The weight argument can be used to create a class weighting, as described in the docs of the criterion. It’s sometimes used to e.g. counter overfitting effects of training a model on imbalanced datasets.
Weighted loss functions are not new in deep learning and were already used in the “classical” machine learning domain.

Right. So these are class weights. I am not sure if I correctly understand the meaning of “internal state”. How would you define it and does internal state varies from module to module? In source code for nn.Softmax(), I can see there are some lines on ‘state’, i.e.,

def __setstate__(self, state):
        if not hasattr(self, 'dim'):
            self.dim = None

Is it the internal state you are refering towards? I also noticed the modules and functions inheriting nn.Module are mostly moved to CUDA, any thoughts?

By “internal states” I mean all class attributes in the modules.
In particular buffers and parameters are of interest, as they would need to be pushed to the appropriate device (e.g. such as the weight buffer in nn.CrossEntropyLoss).
The dim argument in nn.Softmax however will not be pushed to the device, as it’s a plain Python integer to specify the dimension the softmax is applied in.

1 Like

trying this and got no error

input = torch.randn(10, 10, requires_grad=True, device='cuda')

criterion = nn.Softmax(dim = 1)

loss = criterion(input) # no error