Non differentiable loss function in CNN

netaglazer · January 20, 2020, 7:01pm

I have a costume loss function:

def loss(p, y):
    fa= -np.dot(1-y), list(np.log(1-,p))
    l = (y * list(np.log(p))
    prod = (1-y) -  l 
    loss = fa+ np.min(prod)
    return loss

p - prediction
y - target

i have a multi label problem, and i want to minimize the even when the prediction was right in only one label. (hence the minimun in my loss)
the problem is, that this function is non differentiable

how can i use this function anyway?

netaglazer · January 21, 2020, 8:18am

@ptrblck
can you help me with this?

albanD · January 21, 2020, 3:58pm

Hi,

Why is this non-differentiable? You don’t seem to be using non-differentiable operations?
You want to reimplement it though so that it uses torch Tensors and only torch functions so that it can use the autograd engine.

netaglazer · January 21, 2020, 9:15pm

i thought that the minimum operation is not differentiable.
isnt it true?

albanD · January 21, 2020, 9:33pm

The argmin operation is non-differentiable (it returns an integer number). But the operation that returns the min value is. The gradient is just 1 for the value that was selected and 0 for all the others.

Mertens · January 21, 2020, 9:54pm

There’s a distinction here between mathematically differentiable and differentiable wrt. autograd’s internals, right?

That might be a small point of confusion.

netaglazer · January 21, 2020, 10:10pm

i think this is actually the confusing point.

can you explain a little bit more?

Mertens · January 21, 2020, 10:55pm

I can only hypothesize as to what autograd is doing in the background, but one can get a sense of why such a distinction exists by looking at a simpler example: relu.

One way to define relu is relu(x) = max(x, 0). This function isn’t analytically differentiable. However, at every point except 0, it is. In practice, for the purpose of gradient descent, it works well enough to treat the function as if it were differentiable. You’ll rarely be computing the gradient at precisely 0, and even if you do, it’s sufficient to handle things via a special case.

If you’re okay with the behavior of relu, it’s not too hard to generalize things to fit your min function. In general, you won’t be doing computation at the precisely the points that are problematic, and provided your underlying functions are differentiable, the max/min of the functions should be differentiable ‘enough’ to allow for optimization.

There’s some fancier math you could throw in to explain why this works, and I’m sure I could be explaining things better, but that’s the intuition.