I’m not sure if this is the right place to ask this question but I’m going to ask this anyways because people on this forum are really helpful and intelligent!

I was reading about non-linearity functions and given that ReLU or its family (leaky ReLU, PReLU, etc) all are discontinuous functions, yet they work really really well with gradient based optimization algorithms.
How does this work? Shouldn’t the non-linearity create a problem while calculating gradients?

The definition of continutiy: The function f is continuous at some pointc of its domain if the limit of f ( x ), as x approaches c through the domain of f , exists and is equal to f ( c ).

In detail this means three conditions: first, f has to be defined at c (guaranteed by the requirement that c is in the domain of f ). Second, the limit on the left hand side of that equation has to exist. Third, the value of this limit must equal f ( c ).

Actually I do not think non-existence of countable points makes an activation better or affect performance as actually the functions can be implemented using look up tables too.