Hi, I finally got it done.
I can’t define the conv1d layer as
conv = nn.Conv1d(N, N * num_heads, head_dim, head_dim)
since the frame lengths in training data are also variable. If you are interested, the following is what I did:
...
def init(...)
...
self._attention = nn.Conv1d(1, self._header_num, self._header_dim,
stride=self._header_dim)
def forward(...)
...
att_list = []
for h in LastHiddenLayerOutput
# h is the hidden layer output with N frames and d dimensions of
# each, i.e. it has the shape [N, d]
# transform it into shape of [N, 1, d] to suit the conv1d input
score = self._attention(h.unsequeen(0).permute(1, 0, 2))
# Thanks for @spanev for your hint about nn.diagonal
score = score.diagonal(dim1=1, dim2=2)
score = nn.Softmax(dim=0)(score)
# Tricky part, split h_a into k(headers) parts and make
# h_a.shape = [d/k, k, N]
# At this step, score.shape = [N, k], capable to matmul with h_a
h_a = h.view(-1, self._head_num, self._head_dim)
h_a = h_a.permute(2, 1, 0)
# matmul of h_a with score has shape of [d/k, k, k],
# only the diagonal of last 2 dim are the sum of attention over all frames
score = torch.matmul(h_a, score).diagonal(dim1=1, dim2=2)
score = torch.flatten(score)
att_list.append(score)
att_h = torch.stack(att_list)