I have a costume loss function:
def loss(p, y):
fa= -np.dot(1-y), list(np.log(1-,p))
l = (y * list(np.log(p))
prod = (1-y) - l
loss = fa+ np.min(prod)
p - prediction
y - target
i have a multi label problem, and i want to minimize the even when the prediction was right in only one label. (hence the minimun in my loss)
the problem is, that this function is non differentiable
how can i use this function anyway?
can you help me with this?
Why is this non-differentiable? You don’t seem to be using non-differentiable operations?
You want to reimplement it though so that it uses torch Tensors and only torch functions so that it can use the autograd engine.
i thought that the minimum operation is not differentiable.
isnt it true?
argmin operation is non-differentiable (it returns an integer number). But the operation that returns the min value is. The gradient is just 1 for the value that was selected and 0 for all the others.
There’s a distinction here between mathematically differentiable and differentiable wrt. autograd’s internals, right?
That might be a small point of confusion.
i think this is actually the confusing point.
can you explain a little bit more?
I can only hypothesize as to what autograd is doing in the background, but one can get a sense of why such a distinction exists by looking at a simpler example: relu.
One way to define relu is relu(x) = max(x, 0). This function isn’t analytically differentiable. However, at every point except 0, it is. In practice, for the purpose of gradient descent, it works well enough to treat the function as if it were differentiable. You’ll rarely be computing the gradient at precisely 0, and even if you do, it’s sufficient to handle things via a special case.
If you’re okay with the behavior of relu, it’s not too hard to generalize things to fit your min function. In general, you won’t be doing computation at the precisely the points that are problematic, and provided your underlying functions are differentiable, the max/min of the functions should be differentiable ‘enough’ to allow for optimization.
There’s some fancier math you could throw in to explain why this works, and I’m sure I could be explaining things better, but that’s the intuition.