What is the gradient of relu(x) = max(0, x) with respect to x when x = 0 in pytorch?
Hi,
For relu at 0
, we return a gradient of 0
.
Hello @albanD, I am interested in this question in 2023 and failed to find the source code on ReLU’s derivative at 0 in a reasonably short time. Could you help me point out where to read in the source code, please?
If possible, also where to read when this article https://pytorch.org/docs/stable/notes/autograd.html mentions about the derivative behaviours when the function is (locally) convex/concave.
Hey!
Sure it is indeed a bit tricky to track down:
- You can find most of the formulas here: https://github.com/pytorch/pytorch/blob/341c4227a87b62c5e9b8fa919cb1a50baddd87cb/tools/autograd/derivatives.yaml#L2020-L2022 in particular it is using the same thing as threshold for this one.
- This one is a native_function that you can find defined here: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L5937
- Grepping for the CPU inplementation, you cna find it at https://github.com/pytorch/pytorch/blob/341c4227a87b62c5e9b8fa919cb1a50baddd87cb/aten/src/ATen/native/Activation.cpp#L678-L680
- So it is basically calling threshold again with 0, 0 as arguments.
For the doc, do you mean this paragraph Autograd mechanics — PyTorch 2.1 documentation ?
Note that relu, while it is nicely convex, it is not continuously differentiable. So you are in a pretty bad case.
Hi @albanD,
First, thanks for helping point out the $ReLU’(0)$ source code. I hadn’t realized that I should have looked into these files. So, as far as I could understand, is PyTorch simply defining $ReLU’(x)$ as
- 0 when x <= 0;
- dx/dx = 1 otherwise?
As for the doc, yes. I am referring to the Autograd mechanics link. In particular, in the “Gradients for non-differentiable functions” Section, it mentions the cases where the function in question is convex/concave. I had grepped the keyword “convex” but failed to find results in the source code. After reading the yaml files you guided me to, I believe that the mentioned convexity principle is just a guidance for both developers and users of PyTorch and that, to implement each new function, PyTorch developers must do it case-by-case. In other words, no code examination of convexity to automatically follow the principle
If the function is convex (at least locally), use the sub-gradient of minimum norm (it is the steepest descent direction).
Am I right?
Ps. One small detail also confuses me: “the sub-gradient of minimum norm” would be the flattest slope in the case of real-valued function of one real variable, which usually is not the steepest descent.
no code examination of convexity to automatically follow the principle
That is correct, these rules are implemented by hand in the derivatives formula that we have. Nothing automatic like that is done for backward mode AD.
One small detail also confuses me: “the sub-gradient of minimum norm” would be the flattest slope in the case of real-valued function of one real variable, which usually is not the steepest descent.
It is the flattest slope but it still is the direction of steepest descent. Note that this is all about the direction of descent.