Manually re-connect a broken forward path in y=wx operation

Shen · September 20, 2018, 3:19pm

Hello, my work to customize linear multiplication y = wx in a neural network requires to manipulate the shape of w and x matrices. For example, previously w and x have the shape mxk and kxn separately; now I need to temporarily expand them to mx2k and 2kxn to do some processing. Making changes on w and x directly doesn’t seem to be a good idea, since it will change the parameters to update.

Therefore I created temporary tensors w1 and x1 to have the expanded shape and values, and compute y using these two. i.e.

y = w1x1, where w1 has the shape mx2k and x1 has the shape 2kxn

However, the neural network parameters do not get updated properly by doing this. My guess is this customization broke the computational graph (i.e. y is not properly linked to w and x). Is there a way to set any properties of y to fix this broken path?

tom · September 22, 2018, 9:11am

I think to get a useful answer, you’d have to say a bit more what you want the relationship between w1 and x1 to be (e.g. do you want w1 to have two blocks with the entries of w or something completely different or so).

Best regards

Thomas

Shen · September 22, 2018, 6:29pm

Hi, thanks for your reply. One example of the relationship between w1,x1 and w,x could be to round each element in w and x to integer first, then expand them to bits expression. For instance, if x = [[1,0], [1,2]], then x1 wuold be [[0,0], [1,0], [0,1], [1,0]]. Then the result y would be w1x1 plus some post processing.

In this case, y is not directly computed from wx, and hence the backpropogation would not flow properly. I wonder if there is a way to tell the neural network to treat y as computed from wx?

tom · September 22, 2018, 6:39pm

One trick that often helps for “pretend it has been calculated with x even when I used x1” is to use y = wx + (w1x1 - wx).detach_(): In the forward wx is cancelled out, so y = w1x1, but in the backward, the detach_ causes gradients to only flow to wx.
Would that work for you?

Best regards

Thomas

Shen · September 27, 2018, 3:34pm

Hi, I’ve been investigating this issue these days. The approach you provided seems working properly. Thank you so much!!!

However, I also tried something like this:

y = wx
y1 = w1x1
y.data = y1.data

From my understanding, this should also work. But it actually led to different result from your approach. Do you know why is that?

InnovArul · September 27, 2018, 3:37pm

Autograd does not trace this operation (manually changing the data attribute) and hence different result.

Shen · September 27, 2018, 9:34pm

Hi, thanks for your reply. I still don’t get what do you mean by “not trace this operation”. I tried to create a simple network with only 2 parameters, w1 and w2. The middle layer is given by y1=w1x, and the output layer is given by y2 = w2y1. Now if I simply do

y1 = w1x
temp = torch.tensor([1.0, 2.0])
y1.data = temp.data

The backward gradient seems getting calculated properly. How and in which cases do .data cause problems?

ptrblck · September 27, 2018, 11:15pm

As @InnovArul said, the manipulation of data won’t be traced and thus might lead to a wrong result.
Here is a simple example demonstrating this issue.
In the first part of the code, we just calculate the loss for our operations and apply the gradient on w.
We expect values of [[17., 17.]] after the update.
In the second example, I’ve manipulated the underlying data of w after the gradient calculation.
Autograd did not trace this manipulation and the gradients for the original w are now applied on the manipulated w.


# Standard approach
x = torch.ones(1, 2)
w = torch.ones(2, 1, requires_grad=True)
target = torch.full((1,), 10.)
optimizer = torch.optim.SGD([w], lr=1.)

output = x.mm(w)
loss = (output - target)**2
loss.backward()
print(w.grad)
> tensor([[-16.],
          [-16.]])
optimizer.step()
print(w)
> tensor([[ 17.],
          [ 17.]])

# Now manipulate the underlying data
w = torch.ones(2, 1, requires_grad=True)
optimizer = torch.optim.SGD([w], lr=1.)

output = x.mm(w)
loss = (output - target)**2
loss.backward()
print(w.grad)
> tensor([[-16.],
          [-16.]])
w.data = torch.full((2, 1), -100.)
optimizer.step()
print(w)
> tensor([[-84.],
          [-84.]])

Shen · September 28, 2018, 12:12am

Oh, I see. So in my example above, the backward computation will use the manipulated value of y1 (i.e. temp) instead of the original one (i.e. w1x), right?

But for the this code blow:

# method 1
y = wx
temp = w1x1
y.data = temp.data

# method 2
y = wx
temp = w1x1
y += (temp - y).detach_()

I still think they are equivalent, since during the backward computation, the value of y will be changed to temp in both cases. Please correct me if I’m wrong.

InnovArul · September 28, 2018, 12:25am

Ofcourse, there are many ways to make the operation equivalent in calculating gradients. It’s just not advised to do access .data, if you want to completely depend on autograd for the accuracy of your gradients.

From your question on the top, I feel that your use case would be achievable by using .repeat() function of tensor. Can you have a look?

for example, you can say

y = w.repeat(1,2) * x.repeat(1,2)

Shen · September 28, 2018, 12:41am

Hi, thank you for help!! This could work, but instead of this one line code, can I split them to different lines?

w1 = w.repeat(1,2)
x1 = x.repeat(1,2)
y = w1*x1

Since w has .requires_grad set to True, Pytorch document indicates w1 will also has .requires_grad set to True. Will this change the network topology or increase the number parameters?

InnovArul · September 28, 2018, 5:44am

You can absolutely do this and it will not increase the params. i.e., w is the only param in your code (assuming x is data).

tom · September 30, 2018, 6:28am

I would generally recommend against using data at all.
There are two parts:

If you want to break the graph at a point, use .detach() instead.
If you want calculation that without autograd use with torch.no_grad():.

Using .data is a bit like using those two, but except in very special situations (e.g. the optimizer updates internally) there isn’t a good reason to use it except if you like headaches (then it is much cheaper than too much beer at the Octoberfest, though).

Best regards

Thomas

Shen · October 3, 2018, 3:27pm

Thank you all for the help!!

Shen · October 8, 2019, 7:15am

Hi, these days I went back to this problem for some reason, it seems autograd did trace data attribute. Here is an example based on yours:

x = torch.ones(1, 2)
w = torch.ones(2, 1, requires_grad=True)
target = torch.full((1,), 10.)

# manipulate data attribute
x.data = torch.cat((x,x), dim=1)
w.data = torch.cat((w,w), dim=0)

output = x.mm(w)
optimizer = torch.optim.SGD([w], lr=1.)

loss = (output - target)**2
loss.backward()
print(w.grad)
> tensor([[-12.],
          [-12.],
          [-12.],
          [-12.]])

If autograd did not trace data attribute, should w.grad have a size of 2x1 instead of 4x1? Please correct me if I’m wrong. Thx!

ptrblck · October 8, 2019, 5:19pm

Thanks for the example!
The manipulation of the .data attribute will work in your example, as you are manipulating it before performing any computation.
I would still discourage the usage of it, as it might still break, if you manipulate the data after some of the computation graph was already created.