How to discern the computational graph and the corresponding gradient calculation process after I modifies the value of tensors with ".data"

I am informed that modifying the value of a tensor with .data is dangerous since it can generate wrong gradient when backward() is called. However, I write two programmes(below) only to find that the generated gradient is so weird that it can’t be figured out how the gradient is calculated.

The first program is like this:

a = torch.tensor([1, 2, 3.], requires_grad=True)
out = a.sigmoid()
c = out.data  
c[0]=1
c[1]=3
c[2]=4
weight = torch.ones(out.size())
d = torch.autograd.grad(out,a,weight,retain_graph=True)[0]

d is tensor([ 0., -6., -12.])

while the second one is :

a = torch.tensor([1, 2, 3.], requires_grad=True)
out = a.sigmoid()
c = out.data 
c.zero_()
weight = torch.ones(out.size())
d = torch.autograd.grad(out,a,weight,retain_graph=True)[0]

d is tensor([0., 0., 0.])

I know the vector to vector derivative is a little complex.However, since the value of a and the computational flow from a to out is identical in these two programmes. So the value of d should nor differ from each other.

I sincerely appreciate all your help and suggestions. The question can also be found in https://stackoverflow.com/questions/76021517/how-to-discern-the-computational-graph-and-the-corresponding-gradient-calculatio

This is a good example of how .data can cause silent correctness issues.

The derivative of sigmoid(x) is “sigmoid(x) * (1-sigmoid(x)”. The autograd engine will probably implement this by taking the tensor that you got from running sigmoid(x) in the forward, and saving directly to be reused in the backward compute.

Since you are manually modifying the output of the sigmoid call in a way that is hidden from autograd (the .data field), autograd will end up using your mutated “out” tensor in the backward, giving you arbitrary gradients. In the second example, since you mutated out.data to set it to zero, then you get “0 * (1-0)”

1 Like

Thank you for your help!
Please allow me to paraphrase your answer in my own language. for example: z = f(x),when computing the value of dz/dx, pytorch doesn’t always strictly calculate the formulation of f’(x) and then plug in the actual value of x. Sometimes(such as what pytorch has done for sigmoid()), the pytorch will directly get the value of dz/dx from the value of z.

Yep! (Depending on the particular operator. not all operators use z in the compute of dz/dx).

1 Like