Gradient of ReLu at 0

What is the gradient of relu(x) = max(0, x) with respect to x when x = 0 in pytorch?

Hi,

For relu at 0, we return a gradient of 0.

Hello @albanD, I am interested in this question in 2023 and failed to find the source code on ReLU’s derivative at 0 in a reasonably short time. Could you help me point out where to read in the source code, please?

If possible, also where to read when this article https://pytorch.org/docs/stable/notes/autograd.html mentions about the derivative behaviours when the function is (locally) convex/concave.

Hey!

Sure it is indeed a bit tricky to track down:

For the doc, do you mean this paragraph Autograd mechanics — PyTorch 2.1 documentation ?
Note that relu, while it is nicely convex, it is not continuously differentiable. So you are in a pretty bad case.

1 Like

Hi @albanD,

First, thanks for helping point out the $ReLU’(0)$ source code. I hadn’t realized that I should have looked into these files. So, as far as I could understand, is PyTorch simply defining $ReLU’(x)$ as

  • 0 when x <= 0;
  • dx/dx = 1 otherwise?

As for the doc, yes. I am referring to the Autograd mechanics link. In particular, in the “Gradients for non-differentiable functions” Section, it mentions the cases where the function in question is convex/concave. I had grepped the keyword “convex” but failed to find results in the source code. After reading the yaml files you guided me to, I believe that the mentioned convexity principle is just a guidance for both developers and users of PyTorch and that, to implement each new function, PyTorch developers must do it case-by-case. In other words, no code examination of convexity to automatically follow the principle

If the function is convex (at least locally), use the sub-gradient of minimum norm (it is the steepest descent direction).

Am I right?

Ps. One small detail also confuses me: “the sub-gradient of minimum norm” would be the flattest slope in the case of real-valued function of one real variable, which usually is not the steepest descent.

no code examination of convexity to automatically follow the principle

That is correct, these rules are implemented by hand in the derivatives formula that we have. Nothing automatic like that is done for backward mode AD.

One small detail also confuses me: “the sub-gradient of minimum norm” would be the flattest slope in the case of real-valued function of one real variable, which usually is not the steepest descent.

It is the flattest slope but it still is the direction of steepest descent. Note that this is all about the direction of descent.

Ok @albanD, I see. Thanks for having shared your precious opinions and time.