Separate the linear layer for different features

Hi, I finally got it done.

I can’t define the conv1d layer as

conv = nn.Conv1d(N, N * num_heads, head_dim, head_dim) 

since the frame lengths in training data are also variable. If you are interested, the following is what I did:

...
    def init(...)
        ...
        self._attention = nn.Conv1d(1, self._header_num, self._header_dim, 
                              stride=self._header_dim)

    def forward(...)
        ...
        att_list = []
        for h in LastHiddenLayerOutput
            # h is the hidden layer output with N frames and d dimensions of 
            # each, i.e. it has the shape [N, d]
            # transform it into shape of [N, 1, d] to suit the conv1d input
            score = self._attention(h.unsequeen(0).permute(1, 0, 2))

            # Thanks for @spanev for your hint about nn.diagonal
            score = score.diagonal(dim1=1, dim2=2)
            score = nn.Softmax(dim=0)(score)

            # Tricky part, split h_a into k(headers) parts and make
            # h_a.shape = [d/k, k, N]
            # At this step, score.shape = [N, k], capable to matmul with h_a
            h_a = h.view(-1, self._head_num, self._head_dim)
            h_a = h_a.permute(2, 1, 0)

            # matmul of h_a with score has shape of [d/k, k, k], 
            # only the diagonal of last 2 dim are the sum of attention over all frames
            score = torch.matmul(h_a, score).diagonal(dim1=1, dim2=2)

            score = torch.flatten(score)

            att_list.append(score)
        
        att_h = torch.stack(att_list)
1 Like