Hi. I have a model where I will need to use one element from a matrix of elements. I computed a softmax probability distribution over this matrix. Now, i would like to extract the element that corresponds to the maximum probability from the softmax, and then use that element further on:
Gradients will flow back from desired_element to matrix as you only take one element out of a matrix which is differentiable.
Gradients won’t flow back towards softmax though as indexing is not differentiable wrt the index and the argmax operation is not differentiable either.
No they won’t.
You can think about the gradients for idx as follow: how a small change in idx would change the output value? Well idx is an index, so it’s discrete. So gradients don’t make sense here
Thank you @albanD very much for your answer. What if i take a weighted sum of all the elements in matrix (softmax) but then multiplied the those weights with a mask generated based on the idx, as follows:
Doing this you will get gradients back to your W, but it’s not exactly the same function as now you multiply your entry with weights. Not sure how stable this function will be during training though.
Also in your example, I think you want the maximum value out of the max op, not the index of the maximum value.
Hi @albanD and thanks for for fast answers! I actually just want to take the maximum element from matrix and ignore everything else. But i still need the gradients to flow through W because I need it to learn how to choose the maximum.