Tying weight of attention matrix

Rini · October 20, 2021, 10:09pm

Hello,

Is there a way to tie the weights of certain parts of a (square) attention matrix?
For example if I divide the matrix in 4 blocks (2 diagonal and 2 off-diagonal) and want to tie the weights of the two off diagonal blocks how could I do that?

Thanks!

tumble-weed · October 21, 2021, 1:47pm

say full attention matrix size is N and block size is block_N

attention_W  = torch.zeros(N,N)
attention_W_diag1 = torch.randn(block_N,block_N).requires_grad_(True)
attention_W_diag2 = torch.randn(block_N,block_N).requires_grad_(True)
attention_W_off_diag = torch.randn(block_N,block_N).requires_grad_(True)
attention_W[:block_N,:block_N] = attention_W_diag1
attention_W[block_N:,block_N:] = attention_W_diag2
attention_W[block_N:,:block_N] = attention_W_off_diag
attention_W[:block_N,block_N:] = attention_W_off_diag

Rini · November 26, 2021, 9:23pm

Thank you for explaining with the example! I think this is what I was looking for