Gradient of ReLu at 0

gebrahimi · December 17, 2019, 4:20pm

What is the gradient of relu(x) = max(0, x) with respect to x when x = 0 in pytorch?

albanD · December 17, 2019, 4:51pm

Hi,

For relu at 0, we return a gradient of 0.

phunc20 · December 8, 2023, 5:13am

Hello @albanD, I am interested in this question in 2023 and failed to find the source code on ReLU’s derivative at 0 in a reasonably short time. Could you help me point out where to read in the source code, please?

If possible, also where to read when this article https://pytorch.org/docs/stable/notes/autograd.html mentions about the derivative behaviours when the function is (locally) convex/concave.

albanD · December 22, 2023, 6:03pm

Hey!

Sure it is indeed a bit tricky to track down:

You can find most of the formulas here: https://github.com/pytorch/pytorch/blob/341c4227a87b62c5e9b8fa919cb1a50baddd87cb/tools/autograd/derivatives.yaml#L2020-L2022 in particular it is using the same thing as threshold for this one.
This one is a native_function that you can find defined here: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L5937
Grepping for the CPU inplementation, you cna find it at https://github.com/pytorch/pytorch/blob/341c4227a87b62c5e9b8fa919cb1a50baddd87cb/aten/src/ATen/native/Activation.cpp#L678-L680
So it is basically calling threshold again with 0, 0 as arguments.

For the doc, do you mean this paragraph Autograd mechanics — PyTorch 2.1 documentation ?
Note that relu, while it is nicely convex, it is not continuously differentiable. So you are in a pretty bad case.

phunc20 · December 23, 2023, 5:31am

Hi @albanD,

First, thanks for helping point out the $ReLU’(0)$ source code. I hadn’t realized that I should have looked into these files. So, as far as I could understand, is PyTorch simply defining $ReLU’(x)$ as

0 when x <= 0;
dx/dx = 1 otherwise?

As for the doc, yes. I am referring to the Autograd mechanics link. In particular, in the “Gradients for non-differentiable functions” Section, it mentions the cases where the function in question is convex/concave. I had grepped the keyword “convex” but failed to find results in the source code. After reading the yaml files you guided me to, I believe that the mentioned convexity principle is just a guidance for both developers and users of PyTorch and that, to implement each new function, PyTorch developers must do it case-by-case. In other words, no code examination of convexity to automatically follow the principle

If the function is convex (at least locally), use the sub-gradient of minimum norm (it is the steepest descent direction).

Am I right?

Ps. One small detail also confuses me: “the sub-gradient of minimum norm” would be the flattest slope in the case of real-valued function of one real variable, which usually is not the steepest descent.

albanD · December 24, 2023, 3:08pm

no code examination of convexity to automatically follow the principle

That is correct, these rules are implemented by hand in the derivatives formula that we have. Nothing automatic like that is done for backward mode AD.

One small detail also confuses me: “the sub-gradient of minimum norm” would be the flattest slope in the case of real-valued function of one real variable, which usually is not the steepest descent.

It is the flattest slope but it still is the direction of steepest descent. Note that this is all about the direction of descent.

phunc20 · December 25, 2023, 10:57am

Ok @albanD, I see. Thanks for having shared your precious opinions and time.