Non differentiable loss function in CNN

Mertens · January 21, 2020, 10:55pm

I can only hypothesize as to what autograd is doing in the background, but one can get a sense of why such a distinction exists by looking at a simpler example: relu.

One way to define relu is relu(x) = max(x, 0). This function isn’t analytically differentiable. However, at every point except 0, it is. In practice, for the purpose of gradient descent, it works well enough to treat the function as if it were differentiable. You’ll rarely be computing the gradient at precisely 0, and even if you do, it’s sufficient to handle things via a special case.

If you’re okay with the behavior of relu, it’s not too hard to generalize things to fit your min function. In general, you won’t be doing computation at the precisely the points that are problematic, and provided your underlying functions are differentiable, the max/min of the functions should be differentiable ‘enough’ to allow for optimization.

There’s some fancier math you could throw in to explain why this works, and I’m sure I could be explaining things better, but that’s the intuition.