One-hot preds: element 0 of tensors does not require grad and does not have a grad_fn

Hi there,

I’ve coded a model which takes a tensor (batch_size,8,26) input of one-hot rows and yields a tensor (batch_size, 8, 189).
My labels are also tensors (batch_size, 8, 189) of one-hot rows, so i wanted to make preds=model(batch) into one-hot rows trasformin the max of the row in one and others in zeros.
So I can determine if the prediction is correct, not just the drop in loss, which I can get now.

I tried with this:

    def MaxT1(self, batch_size, batch):
        #self.mphonsl,self.phons = 8, 189, because inputs from a fc layer
        batch=(view==batch.max(dim=2, keepdim=True)[0]).view_as(batch)
        return batch

and calling it in the .forward of the model as:

        t=self.MaxT1(batch, batch_size)

However when calling loss.backward() I get

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-f2b8b1fbff2f> in <module>
     22         loss=loss(labels.float(),preds.float())
     23         optimizer.zero_grad() # otherwise it keeps all gradients
---> 24         loss.backward()
     25         optimizer.step()
     26         tb.add_scalar('Loss', loss, epoch)

~\.conda\envs\Pytorch\lib\site-packages\torch\ in backward(self, gradient, retain_graph, create_graph)
    148                 products. Defaults to ``False``.
--> 150         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    152     def register_hook(self, hook):

~\.conda\envs\Pytorch\lib\site-packages\torch\autograd\ in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Any idea about how can i do it differently or solve this?

Thanks in advance

Hi Amos!

You say “So I can determine if the prediction is correct.” I would
call this the accuracy – the fraction of predictions that are correct.

Then you say “not just the drop in loss, which I can get now.” From
this I assume that (when you don’t try to compute the accuracy)
you can compute your loss function, train your network to make
the loss smaller, and the loss does indeed go down.

In the standard approach, you train to reduce a loss function, and
then compute the accuracy independent of the training process.

(You typically compute the accuracy – or other statistics not directly
involved in the training process – in a with torch.no_grad():

The idea is that you want your loss function:

  1. to measure not just whether your prediction is correct, but also
    how far off it is;

  2. and to be nicely differentiable so that you can get gradients that
    are useful for training (back-propagation).

Based on this, the accuracy generally doesn’t make a good loss
function, so you don’t want to use it for training.

If I understand correctly what you are trying to do, this standard
framework applies to your case. So you should train using your
loss function (as I am assuming that you have been able to do
successfully), and place your accuracy calculation in a
with torch.no_grad(): block.

Good luck.

K. Frank

Thanks KFrank,

Maybe I wasn’t that clear.
I’m trainin on loss with MSELoss between the last layer’s output and labels.
I want to apply after the last layer something like the MaxT1 function I made in order to keep the max value of each row.

Something like:

Reshaped output from the last FCL

t1=tensor([[3.6600, 2.0000, 4.0900],
        [5.4400, 6.0000, 5.9000]])


l1=tensor([[0, 0, 1],
        [0, 0, 1]])

I’d like to get from the first tensor

t2=tensor([[0, 0, 1],
        [0, 1, 0]])

and then do MSELoss(l1,t2) instead of MSELoss(l1,t1) as I’m doing now, which way higher. Also buy doing this it would be easier to check how many preds are correct.

Hello Amos!

Have you tried putting the stuff that isn’t part of your actual
training – the stuff “like the MaxT1 function” – inside of a
with torch.no_grad(): block? You want to keep
calculations that don’t contribute to the loss you are training
on out of the gradients used for training so they don’t pollute
things or cause errors. That is the purpose of torch.no_grad().


K. Frank

Thanks K. Frank,

I managed to find a wordaround using nn.Softmax(dim=2) instead of my own function.
They yield similar enough results