Custom Functions are stateless. So defining the __init__() function does nothing. Only the forward/backward static functions are used.
If you create your Tensor directly inside the forward() function and save it on the ctx as ctx.a = a if you want to have it for the backward as well.

I am just thinking not to create tensor every time when forward is called OR not pass by value during forward call as i don’t want it to backpropagate.

One more question which i don’t understand for ReLU backward function is how loss is backpropagate using chain rule of dL/(d(w))=dL/(d(relu_out)) (d(relu_out))/(d(relu_in)) (d(relu_in))/(d(w))?

As per my understanding 1st term on RHS is “grad_output” which is 1, 2nd term is “grade_input” which is also 1 and 3rd term is “relu_input” lets say it is (w * x) then derivatives become x only.

So from chain rule answer is 1 * 1 * x = x which multiply with learning rate and weight “w” will change accordingly.

Now the question is if loss value is 0.2 OR it change to 0.8, how it participate to change the weight value? As in both the case passed value of grad_input and grad_output is 1.

Please let me correct if i misunderstood something.

I am just thinking not to create tensor every time when forward is called OR not pass by value during forward call as i don’t want it to backpropagate.

If you really want to do that, you can just have it as a global in your python file and re-use it.
But I would advise against doing that as globals are quite dangerous in general!
Recreating it in the forward shouldn’t cost you much if you don’t use the function thousands of times every time to forward through your net.

“grad_output” which is 1

That is not true in general, this is actually whatever the grad computed by the other layers is (lower layers).

“grade_input” which is also 1

Is again that depends on the other layers (the ones above you this time).

Let’s say input is given to convolution layer and output of convolution layer is given to ReLU activation function. From output of ReLU, loss is calculated. Assume L1 loss and lets say loss value is 0.2.

So for ReLU, grad_output is coming from L1 loss (lower layer as per your reply i think) and grad_input is given to convolution layer (upper layer) . As per the autograd function of ReLU, “grad_output” (dL/(d(relu_out)) is always 1 (https://github.com/DingKe/pytorch_workplace/blob/master/basic/functions.py ) as ReLU function is f(x) =x and L1_loss = F(x) - target. So derivative with respect to relu_out which is F(x) becomes 1. That is why i am thinking “grad_output” pass to backward function of ReLU is 1.

As per the backward function of ReLU, return value of function is “grade_input” and is always 1.

If output of convolution convolution layer is w*x which is input ot relu_in then (d(relu_in))/(d(w)) will be x.

So dL/(d(w))=dL/(d(relu_out)) (d(relu_out))/(d(relu_in)) (d(relu_in))/(d(w)) = 1 *1 *x=x.

New weight updation will be, w_new = w - learning_rate * dL/(d(w))

So the question is, how loss has contributed to change in weight? I hope i clarified my understanding. Please let me know what i misunderstood if i am wrong.

Thanks for the details.
So indeed for that particular net of conv + relu + l1, you get this result (i didn’t dive into the calculation but that sounds plausible).

So the question is, how loss has contributed to change in weight?

The actual value of the loss does not in this case. If you look at the overall function you wrote, it is actually linear at that point with slope x. So the only thing that will influence in which direction you should move is going to be x.
The simplest analogy here is a 1D identity function: whatever the current point, the gradient is always 1 (the slope x in your example). So you should move to lower values. The actual value of the function is not changing anything.

You can try to do the same computation with a l2 loss (that is not linear) and you will see that the value of the loss will start appearing in the final result.

Just going one level deeper. Lets say i want to predict y = 2*x and initial weight = 1 which we want to be 2 as we want y_pred = w * x. Lets say every time we give the input value x=2 and learning_rate = 0.1 so the weight change at step (1) according to

is 0.8 (1- 0.1 * 2). Again same input x=2 given so in step (2), w_new = 0.6 (0.8 - 0.12). For the same input x=2, in step (3), w_new = 0.4 (0.6-0.12). Even if you change x to any positive value it is going to decrease and w_new never converge to value 2 and so the network.

It means we should never take ReLU output as our final output ? OR should i missed something?

I’m sorry I can’t really follow this.
What would be the pair (x, expected_y) that you want?
What would be the initial weights w0?

It means we should never take ReLU output as our final output ?

It is not advised in general no. Mainly because it does not have any benefit (if you’re value is positive, it will just be an identity) and has the drawback of potentially getting your training stuck if the value becomes negative (as you will get 0 gradient and nothing will move anymore).

OR should i missed something?

Be careful here that for the l1 loss, the gradient can be 1 or -1 depending on the sign of the difference! That will make sure you don’t keep going in one direction forever.