Hi everyone,
I recently watch some code style for scaling gradient values without scaling loss value, but I am confusing about it. I implemented a simple example code myself below.
torch.manual_seed(0)
x = torch.randn(2, 3, requires_grad=True)
y = 5
z = 2 * x
z_backward = z * y
z_new = z.detach() + (z_backward - z_backward.detach())
z_new2 = z + (z_backward - z_backward.detach())
print('------------------------')
print(f'x input: {x}')
print(f'x gradient: {x.grad}')
print('------------------------')
print(f'z: {z}')
print(f'z_new: {z_new}')
print('------------------------')
z.mean().backward(retain_graph=True)
print(f'x gradient from z :{x.grad}')
x.grad.zero_()
z_new.mean().backward(retain_graph=True)
print(f'x gradient from z_new :{x.grad}')
x.grad.zero_()
z_backward.mean().backward(retain_graph=True)
print(f'x gradient from z_backward :{x.grad}')
x.grad.zero_()
z_new2.mean().backward(retain_graph=True)
print(f'x gradient from z_new2 :{x.grad}')
print('------------------------')
The corresponding outputs are showing below
------------------------
x input: tensor([[ 1.5410, -0.2934, -2.1788],
[ 0.5684, -1.0845, -1.3986]], requires_grad=True)
x gradient: None
------------------------
z: tensor([[ 3.0820, -0.5869, -4.3576],
[ 1.1369, -2.1690, -2.7972]], grad_fn=<MulBackward0>)
z_new: tensor([[ 3.0820, -0.5869, -4.3576],
[ 1.1369, -2.1690, -2.7972]], grad_fn=<AddBackward0>)
------------------------
x gradient from z :tensor([[0.3333, 0.3333, 0.3333],
[0.3333, 0.3333, 0.3333]])
x gradient from z_new :tensor([[1.6667, 1.6667, 1.6667],
[1.6667, 1.6667, 1.6667]])
x gradient from z_backward :tensor([[1.6667, 1.6667, 1.6667],
[1.6667, 1.6667, 1.6667]])
x gradient from z_new2 :tensor([[2., 2., 2.],
[2., 2., 2.]])
------------------------
I can understand that the gradient from z
is 0.3333, which is 2 divided by 6. It is also easy to understand that gradient from z_backward
is scaled by 5.
However, what I am confusing is about gradient from z_new
and z_new2
. The output of z
, z_new
and z_new2
generate the same values but different gradient value corresponding to x
Could anyone answer me how the values from z_new
and z_new2
are generated ? Also, the gradent values from z_backward
and z_new
are the same. What is the difference between these two?
Thanks for any suggestion in advance.