Doubt in beginner tutorial regarding chain rule and manual gradient calculation

I’m going through this tutorial.
In Warm-up: numpy section the derivative of loss function with respect to h is being calculated as follows:

    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0

But according to chain rule, should not we do something like the following?


'''
Derivative of loss function wrt h = derivative of loss function wrt h_relu * 
                                    derivative of h_relu wrt h
'''
def reluGrad(x):
    grad = x.copy()
    grad[x > 0] = 1
    return grad
 
grad_h_relu_wrt_h = reluGrad(h_relu)
grad_h = grad_h_relu.dot(grad_h_relu_wrt_h.T)

I know something is fishy here because the dimensions of grad_h in the LHS, which must be equal to the dimensions of h, is not going to match with the dimensions of grad_h_relu.dot(grad_h_relu_wrt_h.T) in the RHS (since grad_h_relu and grad_h_relu_wrt_h are of same dimensions we are always going to end up with a square matrix in the RHS) but I am unable to figure out why this is wrong and how the calculation mentioned in the tutorial will lead to correct gradient calculation.

Did I misunderstand the tutorial code?
Could someone please explain this?

If I remember correctly computing the gradient of relu is a bit tricky precisely because of the problem your mentioning here in that the shapes don’t match. Let us say z is what’s computed before relu and let’s say the shape is 1x1000, and h is the result after passing it through relu. What happens when we compute the gradient of h w.r.t z (dh/dz) is that you have (in theory) to form the entire jacobian matrix.

This is because in theory every value from your input (1x1000) could have influenced every value in the resulting (1x1000) vector but of course since relu is applied element wise this is not actually the case. What happens then is that the only values would be along the diagonal of the jacobian and the rest would be zero (so some values of the diagonal would 1 if z > 0 and 0 otherwise) so it wouldn’t even be a diagonal matrix. Forming the jacobian is thus a huge waste of computation, and in the example given here would be a 1000x1000 matrix that we would need to matrix multiply with z. So if we form the jacobian we have 1x1000 mm 1000x1000 and then the shapes would match.

You can simplify this and do as he does in the tutorial by simply element wise multiplying with the diagonal values directly so in our case doing 1x1000 * 1x1000. This is equivalent to also doing grad_h[h < 0] = 0.

1 Like

Hey,

I think the confusion comes from what the backward pass computes.
if you function is y = relu(x) and you have a loss L.
Then the backward is computing dL/dx = dL/dy * dy/dx. Where dy/dx is the jacobian of the relu function. But keep in mind that you need to do a matrix product with the gradient flowing back.

The Jacobian of the relu is very close to an identity matrix with 1 on the diagonal for positive inputs and 0 on the diagonal for negative inputs.
So the matrix product here will effectively 0 out the entries in the grad flowing back when the input was negative. And just keep it as is when the input was positive.

In the tutorial code,

    grad_h = grad_h_relu.copy()

Is here only to make sure you don’t modify inplace the gradient flowing back inplace (in case it would be used somewhere else.

    grad_h[h < 0] = 0

Zero out the entries corresponding to inputs that were negative.
Note that this is single op is the same as doing the matrix product from the chain rule.

In your code sample, grad = x.copy() does not look right. x should be input to the forward pass while grad should be the gradient flowing back (the input of the backward function).

2 Likes

Thank you so much for the detailed responses. :slight_smile:

So, my approach to calculate relu’s gradient was flawed and an alternative but inefficient way of calculating grad_h would be:

    grad_h = grad_h_relu.copy()
    grad_h = grad_h_relu.dot(Jacobian(h_relu))

right?
Also, in this example, we are processing a batch of 64 examples together. h is of size [64, 100] and so is h_relu. So how do we calculate the Jacobian of matrices?

an alternative but inefficient way of calculating grad_h would be:

Yes exactly.

Also, in this example, we are processing a batch of 64 examples together. h is of size [64, 100] and so is h_relu . So how do we calculate the Jacobian of matrices?

From the autograd point of view, “batch” does not exist, so I usually find it easier to just consider it a function of 2D → 2D.
If you want to reuse the logic above, you need to flatten input/output to make then 1D/1D.
And compute the corresponding Jacobian. Since elements in the batch don’t interact accross batches, then the full Jacobian will be a block diagonal matrix. Where each block is the Jacobian you defined for a single sample.

1 Like

Okay, so we can flatten a matrix to make it a vector and then compute the Jacobian of that but I’m not very sure about how this big Jacobian will fit in this equation:

grad_h = grad_h_relu.dot(Jacobian(h_relu))

Here, grad_h and grad_h_relu are of size [64, 100]. Since h_relu is of size [64, 100] and relu is a pointwise operation, after flattening we will get a vector of size 6400 whose Jacobian would be a square matrix of size [6400, 6400].
This representation of combined Jacobians of all the samples can not be multiplied with grad_h_relu directly. So, I am not sure how the two matrices on the LHS would be multiplied.
One thing I can think of is, flatten the grad_h_relu to make it a vector of size 6400 then multiply with the Jacobian which is of size [6400, 6400] so we will get a vector of size 6400 which we will unflatten to make it of size [64, 100]. Is this the way we do it?

I don’t know if this forum is the right to place to discuss this or not as this discussion is not relevant to PyTorch specifically, and I apologize for bringing it up here if this is not. The doubt came while going through one of the PyTorch tutorials mentioned in the site and I thought many people using this forum would have gone through the tutorial, so thought of asking it here :thinking:

Since you now consider a 1D → 1D function. You need to make sure to view your inputs the right way before/after the formula.
And so her you want to flatten grad_h_relu before the dot product and make sure to view the 1D result as the 2D Tensor you want.

I don’t know if this forum is the right to place to discuss this or not

It definitely is :smiley:

1 Like