Hello, my work to customize linear multiplication y = wx in a neural network requires to manipulate the shape of w and x matrices. For example, previously w and x have the shape mxk and kxn separately; now I need to temporarily expand them to mx2k and 2kxn to do some processing. Making changes on w and x directly doesn’t seem to be a good idea, since it will change the parameters to update.

Therefore I created temporary tensors w1 and x1 to have the expanded shape and values, and compute y using these two. i.e.

y = w1x1, where w1 has the shape mx2k and x1 has the shape 2kxn

However, the neural network parameters do not get updated properly by doing this. My guess is this customization broke the computational graph (i.e. y is not properly linked to w and x). Is there a way to set any properties of y to fix this broken path?

I think to get a useful answer, you’d have to say a bit more what you want the relationship between w1 and x1 to be (e.g. do you want w1 to have two blocks with the entries of w or something completely different or so).

Hi, thanks for your reply. One example of the relationship between w1,x1 and w,x could be to round each element in w and x to integer first, then expand them to bits expression. For instance, if x = [[1,0], [1,2]], then x1 wuold be [[0,0], [1,0], [0,1], [1,0]]. Then the result y would be w1x1 plus some post processing.

In this case, y is not directly computed from wx, and hence the backpropogation would not flow properly. I wonder if there is a way to tell the neural network to treat y as computed from wx?

One trick that often helps for “pretend it has been calculated with x even when I used x1” is to use y = wx + (w1x1 - wx).detach_(): In the forward wx is cancelled out, so y = w1x1, but in the backward, the detach_ causes gradients to only flow to wx.
Would that work for you?

Hi, thanks for your reply. I still don’t get what do you mean by “not trace this operation”. I tried to create a simple network with only 2 parameters, w1 and w2. The middle layer is given by y1=w1x, and the output layer is given by y2 = w2y1. Now if I simply do

As @InnovArul said, the manipulation of data won’t be traced and thus might lead to a wrong result.
Here is a simple example demonstrating this issue.
In the first part of the code, we just calculate the loss for our operations and apply the gradient on w.
We expect values of [[17., 17.]] after the update.
In the second example, I’ve manipulated the underlying data of w after the gradient calculation.
Autograd did not trace this manipulation and the gradients for the original w are now applied on the manipulated w.

# Standard approach
x = torch.ones(1, 2)
w = torch.ones(2, 1, requires_grad=True)
target = torch.full((1,), 10.)
optimizer = torch.optim.SGD([w], lr=1.)
output = x.mm(w)
loss = (output - target)**2
loss.backward()
print(w.grad)
> tensor([[-16.],
[-16.]])
optimizer.step()
print(w)
> tensor([[ 17.],
[ 17.]])
# Now manipulate the underlying data
w = torch.ones(2, 1, requires_grad=True)
optimizer = torch.optim.SGD([w], lr=1.)
output = x.mm(w)
loss = (output - target)**2
loss.backward()
print(w.grad)
> tensor([[-16.],
[-16.]])
w.data = torch.full((2, 1), -100.)
optimizer.step()
print(w)
> tensor([[-84.],
[-84.]])

Oh, I see. So in my example above, the backward computation will use the manipulated value of y1 (i.e. temp) instead of the original one (i.e. w1x), right?

But for the this code blow:

# method 1
y = wx
temp = w1x1
y.data = temp.data
# method 2
y = wx
temp = w1x1
y += (temp - y).detach_()

I still think they are equivalent, since during the backward computation, the value of y will be changed to temp in both cases. Please correct me if I’m wrong.

Ofcourse, there are many ways to make the operation equivalent in calculating gradients. It’s just not advised to do access .data, if you want to completely depend on autograd for the accuracy of your gradients.

From your question on the top, I feel that your use case would be achievable by using .repeat() function of tensor. Can you have a look?

Hi, thank you for help!! This could work, but instead of this one line code, can I split them to different lines?

w1 = w.repeat(1,2)
x1 = x.repeat(1,2)
y = w1*x1

Since w has .requires_grad set to True, Pytorch document indicates w1 will also has .requires_grad set to True. Will this change the network topology or increase the number parameters?

I would generally recommend against using data at all.
There are two parts:

If you want to break the graph at a point, use .detach() instead.

If you want calculation that without autograd use with torch.no_grad():.

Using .data is a bit like using those two, but except in very special situations (e.g. the optimizer updates internally) there isn’t a good reason to use it except if you like headaches (then it is much cheaper than too much beer at the Octoberfest, though).

Thanks for the example!
The manipulation of the .data attribute will work in your example, as you are manipulating it before performing any computation.
I would still discourage the usage of it, as it might still break, if you manipulate the data after some of the computation graph was already created.