# Scale gradient without scaling loss values

Hi everyone,

I recently watch some code style for scaling gradient values without scaling loss value, but I am confusing about it. I implemented a simple example code myself below.

``````torch.manual_seed(0)
y = 5

z = 2 * x
z_backward = z * y
z_new = z.detach() + (z_backward - z_backward.detach())
z_new2 = z + (z_backward - z_backward.detach())

print('------------------------')
print(f'x input: {x}')
print('------------------------')
print(f'z: {z}')
print(f'z_new: {z_new}')
print('------------------------')
z.mean().backward(retain_graph=True)
z_new.mean().backward(retain_graph=True)
z_backward.mean().backward(retain_graph=True)
z_new2.mean().backward(retain_graph=True)
print('------------------------')
``````

The corresponding outputs are showing below

``````------------------------
x input: tensor([[ 1.5410, -0.2934, -2.1788],
------------------------
z: tensor([[ 3.0820, -0.5869, -4.3576],
z_new: tensor([[ 3.0820, -0.5869, -4.3576],
------------------------
x gradient from z :tensor([[0.3333, 0.3333, 0.3333],
[0.3333, 0.3333, 0.3333]])
x gradient from z_new :tensor([[1.6667, 1.6667, 1.6667],
[1.6667, 1.6667, 1.6667]])
x gradient from z_backward :tensor([[1.6667, 1.6667, 1.6667],
[1.6667, 1.6667, 1.6667]])
x gradient from z_new2 :tensor([[2., 2., 2.],
[2., 2., 2.]])
------------------------
``````

I can understand that the gradient from `z` is 0.3333, which is 2 divided by 6. It is also easy to understand that gradient from `z_backward ` is scaled by 5.

However, what I am confusing is about gradient from `z_new ` and `z_new2`. The output of `z`, `z_new` and `z_new2` generate the same values but different gradient value corresponding to `x`

Could anyone answer me how the values from `z_new` and `z_new2` are generated ? Also, the gradent values from `z_backward` and `z_new` are the same. What is the difference between these two?

Thanks for any suggestion in advance.

`z_new` uses `z_backward` to calculate the gradients as its the only attached tensor to the computation graph. The other two tensors were explicitly detached.
`z_new2` uses `z` and `z_backward` to calculate the gradients as both are attached to the computation graph.

That’s expected since `z_new` can only use `z_backward` during its backward pass since the other two tensors are detached and thus constants.

Nothing when it comes to the gradient, but their forward pass differs since `z_new` returns `z` as its forward output activation.

1 Like

Thanks for your quickly response. Now I’m understanding how the gradients from `z_new` and `z_new_2` come from. Is it better just write something like `z_new2 = z + z_backward`?

I am also still confusing some points

I still cannot understand why should do something like `(z_backward - z_backward.detach())`?
Why not just do `z_new = z.detach() + z_backward` or just simply `z_backward.backward()` itself.
Also, is there any advantage or consideration without coding `z_backward.backward()` itself ? What I guess is that it can scale gradient value without scale the output `z_new` (loss), maybe it’s better to adust learning rate?

``````z_new = z.detach() + (z_backward - z_backward.detach())
is that you will get `z` as the forward output and will use `z_backward` to calculate the gradients in the backward pass.
In the foward pass `z_backward` is cancelled due to the subtraction and only `z.detach()` will define the value. However, during the backward Autograd will only use `z_backward` since it’s the only attached tensor.
Thanks for your clear explanation!!! I think I understand it clearly now 