Scale gradient without scaling loss values

Hi everyone,

I recently watch some code style for scaling gradient values without scaling loss value, but I am confusing about it. I implemented a simple example code myself below.

x = torch.randn(2, 3, requires_grad=True)
y = 5

z = 2 * x
z_backward = z * y
z_new = z.detach() + (z_backward - z_backward.detach())
z_new2 = z + (z_backward - z_backward.detach())

print(f'x input: {x}')
print(f'x gradient: {x.grad}')
print(f'z: {z}')
print(f'z_new: {z_new}')
print(f'x gradient from z :{x.grad}')
print(f'x gradient from z_new :{x.grad}')
print(f'x gradient from z_backward :{x.grad}')
print(f'x gradient from z_new2 :{x.grad}')

The corresponding outputs are showing below

x input: tensor([[ 1.5410, -0.2934, -2.1788],
        [ 0.5684, -1.0845, -1.3986]], requires_grad=True)
x gradient: None
z: tensor([[ 3.0820, -0.5869, -4.3576],
        [ 1.1369, -2.1690, -2.7972]], grad_fn=<MulBackward0>)
z_new: tensor([[ 3.0820, -0.5869, -4.3576],
        [ 1.1369, -2.1690, -2.7972]], grad_fn=<AddBackward0>)
x gradient from z :tensor([[0.3333, 0.3333, 0.3333],
        [0.3333, 0.3333, 0.3333]])
x gradient from z_new :tensor([[1.6667, 1.6667, 1.6667],
        [1.6667, 1.6667, 1.6667]])
x gradient from z_backward :tensor([[1.6667, 1.6667, 1.6667],
        [1.6667, 1.6667, 1.6667]])
x gradient from z_new2 :tensor([[2., 2., 2.],
        [2., 2., 2.]])

I can understand that the gradient from z is 0.3333, which is 2 divided by 6. It is also easy to understand that gradient from z_backward is scaled by 5.

However, what I am confusing is about gradient from z_new and z_new2. The output of z, z_new and z_new2 generate the same values but different gradient value corresponding to x

Could anyone answer me how the values from z_new and z_new2 are generated ? Also, the gradent values from z_backward and z_new are the same. What is the difference between these two?

Thanks for any suggestion in advance.

z_new uses z_backward to calculate the gradients as its the only attached tensor to the computation graph. The other two tensors were explicitly detached.
z_new2 uses z and z_backward to calculate the gradients as both are attached to the computation graph.

That’s expected since z_new can only use z_backward during its backward pass since the other two tensors are detached and thus constants.

Nothing when it comes to the gradient, but their forward pass differs since z_new returns z as its forward output activation.

1 Like

Thanks for your quickly response. Now I’m understanding how the gradients from z_new and z_new_2 come from. Is it better just write something like z_new2 = z + z_backward?

I am also still confusing some points

I still cannot understand why should do something like (z_backward - z_backward.detach())?
Why not just do z_new = z.detach() + z_backward or just simply z_backward.backward() itself.
Also, is there any advantage or consideration without coding z_backward.backward() itself ? What I guess is that it can scale gradient value without scale the output z_new (loss), maybe it’s better to adust learning rate?

The advantage of

z_new = z.detach() + (z_backward - z_backward.detach())

is that you will get z as the forward output and will use z_backward to calculate the gradients in the backward pass.
In the foward pass z_backward is cancelled due to the subtraction and only z.detach() will define the value. However, during the backward Autograd will only use z_backward since it’s the only attached tensor.

1 Like

Thanks for your clear explanation!!! I think I understand it clearly now :slight_smile: