How copied parameters updated?

W is a parameter tensor, and A = W[[1,1]] by indexing operator. So the two elements of A are copied from same source. Are the gradients of A[0] and A[1] computed independently? If yes, the computation is double.

Yes the gradients for A[0] and A[1] will be done independently and the gradient for W[1] will be the sum of the two. This is the gradient of the function you implement.

Sound sad and thanks for reply.
For A = W[[1,1]] ,could I make A[0], A[1] and W[1] share same gradient and gradient function, or even share same storage?

I’m not sure what you mean by that. Could you write a code sample and what you expect to find in the .grad fields?

Here is pseudocode

for input (X, Y)

W = nn.Parameter()
index = [1,1,1,2,3]
output = X*W[index]
loss = function(output, Y)
loss.backward()

In above case, parameter W[1] has three copies whose gradients will be computed independently as you say. I hope the gradient of W[1] just computed once in loss.backward(). W[1]'s copies don’t need to compute the gradients, which can share the same gradient with W[1] . Otherwise the backward gets slow.

Here is pseudocode

for input (X, Y)

W = nn.Parameter()
index = [1,1,1,2,3]
output = X*W[index]
loss = function(output, Y)
loss.backward()

In above case, parameter W[1] has three copies whose gradients will be computed independently as you say. I hope the gradient of W[1] just computed once in loss.backward(). W[1]'s copies don’t need to compute the gradients, which can share the same gradient with W[1] . Otherwise the backward gets slow.

W[1] is computed once. But it will contain the sum of the gradients for each of the place where it is used.
If you have W that contains 3 values.
O contains [W[1], W[1], w[1], W[2], W[3]].
And your loss is sum(O).
Then when you call backward on this loss, the gradient of O wrt this loss is [1, 1, 1, 1, 1]. And the gradient of W wrt this loss is [3, 1, 1].
So yes the gradient of W[1] is computed once.

Thanks for clear explanation. I understand W[1] is computed once now.

For O which contains [W[1], W[1], w[1], W[2], W[3]], the gradients of O[0],O[1],O[2] are computed independently, although they have same value. I think the computation is wasted. Could gradients of O[0],O[1],O[2], W[1] wrt loss be the same? I hope that just one of gradients of O[0],O[1],O[2],W[1] are computed and the others share it.

Thanks for clear explanation. I understand W[1] is computed once now.:grinning:

For O which contains [W[1], W[1], w[1], W[2], W[3]], the gradients of O[0],O[1],O[2] are computed independently, although they have same value. I think the computation is wasted. Could gradients of O[0],O[1],O[2], W[1] wrt loss be the same? I hope that just one of gradients of O[0],O[1],O[2],W[1] are computed and the others share it.

They are the same in my example but could be different.
With the same O, if my loss is: loss = 2*O[0] + 3*O[1] + 4*O[2] + 5*O[3] + 6*O[4]
Then the gradients of O wrt the loss would be [2, 3, 4, 5, 6] and the gradient of W [9, 5, 6].