I’m not sure if this is the right place to ask this question but I’m going to ask this anyways because people on this forum are really helpful and intelligent!
I was reading about non-linearity functions and given that ReLU or its family (leaky ReLU, PReLU, etc) all are discontinuous functions, yet they work really really well with gradient based optimization algorithms.
How does this work? Shouldn’t the non-linearity create a problem while calculating gradients?
The definition of continutiy: The function f is continuous at some point c of its domain if the limit of f ( x ), as x approaches c through the domain of f , exists and is equal to f ( c ).
In detail this means three conditions: first, f has to be defined at c (guaranteed by the requirement that c is in the domain of f ). Second, the limit on the left hand side of that equation has to exist. Third, the value of this limit must equal f ( c ).
So, ReLU etc all are continuous.
Here is a great post by @rasbt .
In the end, I think you have confused non-linearity and discontinuity.
Actually, all we are looking for is to have non-linear activation functions to break linearity between each W.x+b calculations in different layers.
Yeah sorry! I meant discontinuity there… I dont know why i wrote non-linearity there
It’s ok it happens to me all the time. But still remember that ReLU is continuous but has not derivative at x=0 based on the post I referenced.
Yeah! So just because x=0 does not usually exist that commonly in deep learning context, ReLU has turned out to be a better choice. Right?
Actually I do not think non-existence of countable points makes an activation better or affect performance as actually the functions can be implemented using look up tables too.