Shared weights using permutation view sematics

I want to share the same set of weights across several nodes in a layer, somewhat like a convolutional neural net, but where the weight sharing does not reduce to a simple slicing operation, but instead require indirection through a permutation array.

A graphic view of what I have in mind: Game of Y ANN

$$ B_i = \phi(\sum_j A[p_{ij}]w[q_{ij}]) $$

where $p_{ij}$ and $q_{ij}$ are the node index and weight index respectively, as compared to the usual use of $j$ as the node and weight index as in $ B_i = \phi(\sum_j A_{ij} w_{ij}) $

Edit: To be clear, the key functionality is that I need a view rather than a copy from the permutation. That is, I want changes to the permuted vector to be reflect in the original and in all other permuted views.

Edit: I’m thinking this might simply not be supported. In order to implement this it would be necessary to implement a permutation filter at the kernel level, in the same way slicing operations are implemented.

Slicing and similar operation produce views, which are implemented by changing the stride associated with each dimension. Permutation would have to implement a level of indirection, and therefore involves a new flavor of Tensor with corresponding kernel implementation. Instead of multiplying a coordinate by a stride, the index would first be looked up in the permitation vector.

Oops, guess we don’t have mathjax installed. Would be nice. Anyway, put another way, Is it possible to implement a Tensor that has a built in level of indirection through an index array?

Hi Ken!

Pytorch tensor indexing should let you use a tensor of your permuted indices
to index into your A and w tensors.

I don’t follow the details of your use case, but here is an example where the
permutation indices are packaged as 1d tensors:

>>> import torch
>>> print (torch.__version__)
1.13.0
>>>
>>> _ = torch.manual_seed (2023)
>>>
>>> m = 5
>>> n = 7
>>>
>>> A = torch.randn (m, n)
>>> w = torch.randn (m, n)
>>>
>>> p = torch.randperm (5)
>>> q = torch.randperm (5)
>>>
>>> B = (A[:, p] * w[:, q]).sum (dim = 1)
>>>
>>> B
tensor([-4.3236,  0.2211, -0.5869,  0.6099, -1.7975])

Best.

K. Frank

1 Like

Thanks, that correctly conveys the permutation operation. But unfortunately the permutation is a copy, not a view. This would work for feed-forward, but I don’t see how it would work for backprop. To do that, it seems like I would need the permutation to provide a permuted view rather than a copy. That way modifying a weight during backprop would accumulate through all of the copies.

Maybe I need to copy (through the permutation) of the feed-forward, then assign the average of shared weight instances to the original weight on backprop. Does that seem reasonable? How would I do that?

This approach would be slightly faster on FF, but slower on BP, because BP would add an averaging step. I guess that’s ok.

Or maybe I do as @KFrank suggests on FF, and tweek the autograd system to redirect all the gradients for each weight to the cannonical instance of that weight. But not sure how to do that. Any thoughts?

Actually, I think you got me on the right track. The trouble is with permutation of the weight vector. But I can reframe my problem such that weights are not permuted, but instead permute the nodes. Then I can reuse the same weight vector everwhere. I make a layer that produces all the necessary node permutations, converting the N element input vector into an NxM tensor where M is the number of permutations. Now the next layer is just like a convolutional network, with a N element kernel.

Hi Ken!

Perhaps I misunderstand your use case, but autograd does manage
backpropagation through tensor indexing the way I would expect.

Consider:

>>> import torch
>>> print (torch.__version__)
1.13.0
>>>
>>> w = torch.ones (5, requires_grad = True)   # some weights to backpropagate to
>>>
>>> p = torch.tensor ([4, 1, 2, 3, 0])         # permutation swaps first and last
>>> c = 10 * (torch.arange (5.0) + 1)          # some factors to create gradient
>>>
>>> loss = (c * w[p]).sum()
>>> loss.backward()
>>>
>>> w.grad                                     # gradient of w is also permuted
tensor([50., 20., 30., 40., 10.])

If this isn’t the behavior you want, could you provide more clarification?

Best.

K. Frank

Oh, that is interesting! Even works if the index has repeats. Yeah, I didn’t understand the autograd system before this. Now I do. Thanks.

w = torch.ones (5, requires_grad = True)   # some weights to backpropagate to

p = torch.tensor ([4, 2, 2, 3, 0])         # permutation swaps first and last
c = 10 * (torch.arange (5.0) + 1)          # some factors to create gradient

loss = (c * w[p]).sum()
loss.backward()
print (w.grad)     

tensor([50.,  0., 50., 40., 10.])