Confused about torch.max() and gradient

x = Variable(torch.randn(1,3),requires_grad=True)
z,_ = torch.max(x,1)
z.backward()
print(x.grad)

Variable containing:
 1  0  0
[torch.FloatTensor of size 1x3]

I understand the max operation is a not differentiable operation. So why can I still get the gradient here?

1 Like

max simply selects the greatest value and ignores the others, so max is the identity operation for that one element. Therefore the gradient can flow backwards through it for just that one element.

9 Likes

Also, argmax is not continuous almost everywhere. But max is continuous everywhere.

8 Likes
z,y = torch.max(x,1)

So that is the reason y doesn’t have a gradient function?

What do you mean by “argmax is not continuous almost everywhere. But max is continuous everywhere” ?

Do you mean that we can do backpropagation with max operation but not argmax operation ?

To be precise, I should have said that argmax is not differentiable, but max is.

4 Likes

Is there a method to make index having gradient function?
i.e.

import torch
h = torch.randn(1,2,5, requires_grad=True); print(h)
val,idx = h.max(1, keepdim=True)
print(val)
print(idx)
print(val)

outputs are:

tensor([[[-0.5372, -0.4683, 0.4891, -0.1686, -0.4147],
[-1.4412, 1.2837, -0.4467, 0.1731, 1.3256]]], requires_grad=True)
tensor([[[-0.5372, 1.2837, 0.4891, 0.1731, 1.3256]]],
grad_fn=)
tensor([[[0, 1, 0, 1, 1]]])

I want the tensor([[[0, 1, 0, 1, 1]]]) to have gradient function.

It is mathematically not differentiable, so no.

In this paper section 3.3

We first select Y frames (i.e. keyframes) based on the prediction scores from the
decoder.

The decoder output is [2,320], which means non-keyframe score and key frame score of the 320 frames. We want to find a 0/1 vector according to the decoder output but the process of [2,320] → 0/1 vector seems not differentiable…

How to implement this in pytorch?

Thank you very much.

i guess it means that we can say the same thing to the minimum operation?

and also, can you explain it a little bit more?
why the fact that it is identity operation for the max elememt changes the situation?
and if so, why they invented the softmax?

Actually, softmax is more like softargmax… I have to say softmax is a terrible name.

1 Like