# Manually re-connect a broken forward path in y=wx operation

Hello, my work to customize linear multiplication y = wx in a neural network requires to manipulate the shape of w and x matrices. For example, previously w and x have the shape `mxk` and `kxn` separately; now I need to temporarily expand them to `mx2k` and `2kxn` to do some processing. Making changes on w and x directly doesn’t seem to be a good idea, since it will change the parameters to update.

Therefore I created temporary tensors w1 and x1 to have the expanded shape and values, and compute y using these two. i.e.

``````y = w1x1, where w1 has the shape mx2k and x1 has the shape 2kxn
``````

However, the neural network parameters do not get updated properly by doing this. My guess is this customization broke the computational graph (i.e. y is not properly linked to w and x). Is there a way to set any properties of y to fix this broken path?

1 Like

I think to get a useful answer, you’d have to say a bit more what you want the relationship between w1 and x1 to be (e.g. do you want w1 to have two blocks with the entries of w or something completely different or so).

Best regards

Thomas

1 Like

Hi, thanks for your reply. One example of the relationship between w1,x1 and w,x could be to round each element in w and x to integer first, then expand them to bits expression. For instance, if `x = [[1,0], [1,2]]`, then `x1 wuold be [[0,0], [1,0], [0,1], [1,0]]`. Then the result y would be `w1x1` plus some post processing.

In this case, `y` is not directly computed from `wx`, and hence the backpropogation would not flow properly. I wonder if there is a way to tell the neural network to treat `y` as computed from `wx`?

1 Like

One trick that often helps for “pretend it has been calculated with x even when I used x1” is to use `y = wx + (w1x1 - wx).detach_()`: In the forward wx is cancelled out, so y = w1x1, but in the backward, the detach_ causes gradients to only flow to wx.
Would that work for you?

Best regards

Thomas

1 Like

Hi, I’ve been investigating this issue these days. The approach you provided seems working properly. Thank you so much!!!

However, I also tried something like this:

``````y = wx
y1 = w1x1
y.data = y1.data
``````

From my understanding, this should also work. But it actually led to different result from your approach. Do you know why is that?

1 Like

Autograd does not trace this operation (manually changing the `data` attribute) and hence different result.

1 Like

Hi, thanks for your reply. I still don’t get what do you mean by “not trace this operation”. I tried to create a simple network with only 2 parameters, w1 and w2. The middle layer is given by y1=w1x, and the output layer is given by y2 = w2y1. Now if I simply do

``````y1 = w1x
temp = torch.tensor([1.0, 2.0])
y1.data = temp.data
``````

The backward gradient seems getting calculated properly. How and in which cases do `.data` cause problems?

1 Like

As @InnovArul said, the manipulation of `data` won’t be traced and thus might lead to a wrong result.
Here is a simple example demonstrating this issue.
In the first part of the code, we just calculate the loss for our operations and apply the gradient on `w`.
We expect values of `[[17., 17.]]` after the update.
In the second example, I’ve manipulated the underlying data of `w` after the gradient calculation.
Autograd did not trace this manipulation and the gradients for the original `w` are now applied on the manipulated `w`.

``````
# Standard approach
x = torch.ones(1, 2)
target = torch.full((1,), 10.)
optimizer = torch.optim.SGD([w], lr=1.)

output = x.mm(w)
loss = (output - target)**2
loss.backward()
> tensor([[-16.],
[-16.]])
optimizer.step()
print(w)
> tensor([[ 17.],
[ 17.]])

# Now manipulate the underlying data
optimizer = torch.optim.SGD([w], lr=1.)

output = x.mm(w)
loss = (output - target)**2
loss.backward()
> tensor([[-16.],
[-16.]])
w.data = torch.full((2, 1), -100.)
optimizer.step()
print(w)
> tensor([[-84.],
[-84.]])
``````
1 Like

Oh, I see. So in my example above, the backward computation will use the manipulated value of y1 (i.e. temp) instead of the original one (i.e. w1x), right?

But for the this code blow:

``````# method 1
y = wx
temp = w1x1
y.data = temp.data

# method 2
y = wx
temp = w1x1
y += (temp - y).detach_()
``````

I still think they are equivalent, since during the backward computation, the value of `y` will be changed to `temp` in both cases. Please correct me if I’m wrong.

1 Like

Ofcourse, there are many ways to make the operation equivalent in calculating gradients. It’s just not advised to do access `.data`, if you want to completely depend on autograd for the accuracy of your gradients.

From your question on the top, I feel that your use case would be achievable by using `.repeat()` function of tensor. Can you have a look?

for example, you can say

``````y = w.repeat(1,2) * x.repeat(1,2)
``````
1 Like

Hi, thank you for help!! This could work, but instead of this one line code, can I split them to different lines?

``````w1 = w.repeat(1,2)
x1 = x.repeat(1,2)
y = w1*x1
``````

Since `w` has `.requires_grad` set to `True`, Pytorch document indicates `w1` will also has `.requires_grad` set to `True`. Will this change the network topology or increase the number parameters?

1 Like

You can absolutely do this and it will not increase the params. i.e., `w` is the only param in your code (assuming `x` is data).

1 Like

I would generally recommend against using data at all.
There are two parts:

• If you want to break the graph at a point, use `.detach()` instead.
• If you want calculation that without autograd use `with torch.no_grad():`.

Using `.data` is a bit like using those two, but except in very special situations (e.g. the optimizer updates internally) there isn’t a good reason to use it except if you like headaches (then it is much cheaper than too much beer at the Octoberfest, though).

Best regards

Thomas

2 Likes

Thank you all for the help!!

1 Like

Hi, these days I went back to this problem for some reason, it seems autograd did trace `data` attribute. Here is an example based on yours:

``````x = torch.ones(1, 2)
target = torch.full((1,), 10.)

# manipulate data attribute
x.data = torch.cat((x,x), dim=1)
w.data = torch.cat((w,w), dim=0)

output = x.mm(w)
optimizer = torch.optim.SGD([w], lr=1.)

loss = (output - target)**2
loss.backward()
> tensor([[-12.],
[-12.],
[-12.],
[-12.]])
``````

If autograd did not trace data attribute, should `w.grad` have a size of 2x1 instead of 4x1? Please correct me if I’m wrong. Thx!

1 Like

Thanks for the example!
The manipulation of the `.data` attribute will work in your example, as you are manipulating it before performing any computation.
I would still discourage the usage of it, as it might still break, if you manipulate the data after some of the computation graph was already created.

1 Like