The model predicts some outputs which I then take and convert into a numpy array. (so I also detach them first from the tensor.)

During the course of subsequent calculations on this numpy array , I use a argmax() to finally return me something ( for example something like [[1,4,6,3]]. Lets call them predictions.)

I convert these predictions to tensors and also set the requires_grad paramter to True as:

torch.tensor(predictions, requires_grad=True)

I then try and apply the torch.nn.CrossEntropyLoss on this and my ground truth as :

When I run the code, I see the same loss for all the epochs and the model basically does not train.
For example:

Epoch : 1, Loss : 4.0
Epoch : 2, Loss : 4.0
Epoch : 3, Loss : 4.0
Epoch : 4, Loss : 4.0

Now , I know that the argmax() is not differentiable and hence should not be used in the loss. However, I try and do some post processing on my outputs (that involves me doing the argmax() operation) before converting it to a tensor and then apply the CrossEntropy loss to it.
If it is still wrong, it there an alternative to using the argmax()?
Could you please help me with this??

Hello! I understand that argmax is not differentiable and thus should not be used in the loss function. However, I still don’t get why it should be a problem when I am using it as a part of some independent operations and then using the results obtained in a fresh numpy array for calculating the loss. (I am sorry if this is not clear. I can elaborate on it if you want)
I also realize that Cross Entropy deals with logits and thus its probably wrong to apply Cross entropy loss in my use case.
However, even when I apply the L1Loss I still don’t see any change in my loss on an epoch to epoch basis.
Could you help me understand it a little better?
Also do you have a potential alterntative for the argmax function?

Consider the following model output tensor([[3.4, 0.1, 0.1]]). This means the model is predicting class 0. If we ran argmax, we’d get 0. But suppose class 2 was the correct class. If we tried L1Loss on the class integers, we’d have a loss of 2.0. But then if 1 were the correct class, the loss would be 1.0.

Why should the order of the classes make any difference for the amount of loss? It shouldn’t. If we shuffled the classes, along with the model output, the loss should be the same. But if using L1Loss on argmax values, you’re not going to get any sort of sensible loss.

If you wanted to apply L1Loss, instead, you could change the targets to a one_hot vector. I.e. class 2 being tensor([[0.0, 0.0, 1.0]]) and calculate loss with respect to the raw model outputs. At least that will give you something more sensible. And this would be agnostic to changes in order.

Although you now would have a new set of issues. A big one being that we want the model to maximize the value at class position 2 and minimize the values everywhere else. As opposed to just targetting 0s and 1.

That’s where CrossEntropyLoss comes in. It handles that for you all behind the scenes. But it requires a vector of class probabilities as the model output.

A non-differentiable operation will break the computation graph and will thus not allow you to calculate the gradients for any parameter previously used in the forward pass. Setting the requires_grad attribute of the detached tensor to true won’t help and somehow attach this tensor back to the computation graph.
The used loss function is irrelevant at this point, since the gradients w.r.t. the model parameters won’t even be calculated.

@ptrblck I think this behaviour is also the same when I maually detach a tensor?? (which makes sense)
For example, if I have a tensor a and I do something like?

However, the question is unanswered. Can you @J_Johnson@ptrblck , please suggest a work around to detaching to so that my computation graph is not altered.

It’s not possible and wouldn’t make sense, since detaching a tensors explicitly cuts the computation graph. If you want to keep the computation graph you shouldn’t call detach() on the tensor.
If you need to use a 3rd party library, such as numpy, you would need to implement the backward function manually via a custom autograd.Function.

For non-differentiable operations you could create a custom autograd.Function returning all zero gradients (or somehow come up with other gradients). However, I wouldn’t know how all-zero gradients would be beneficial.

Hello @J_Johnson@ptrblck , I was converting the tensors to numpy array to get some dot products.
As the loss does not change, I then replaced the numpy operations with the torch.mul followed by torch.sum. I then save it in a numpy array and then convert it into tensors using torch.from_numpy(my_numpy_array).
I then apply the torch.nn.BCEWithLogitsLoss.
However, I still do not see any changes in the loss. Its the same.

PyTorch can take the dot product. By handling it through torch, you can maintain the graph. I see no reason to change to numpy as that will break the graph.

If you want to use autograd, all operations must be handled through PyTorch.

This behavior is still expected as explained multiple times already.
If you detach a tensor from the computation graph (e.g. by transforming it to a numpy array) the previous operations and used parameters will not get any valid gradients computed from the final loss.
Re-creating a tensor via torch.from_numpy or the deprecated Variable will not change this behavior and the computation graph will still be cut.
In my previous post I’ve mentioned already that you should either avoid detaching the tensor from the computation graph by sticking to differentiable PyTorch operations or you would have to implement a custom autograd.Function.