Perhaps if you could give a very high level description of your use case,

some expert in that area might have suggestions. Are you working on

some sort of (semi-) standard model? Is this a research direction that

others might have experience with?

Ok, it seems like a good idea, so i try.

I started studying this model for dependency parsing. Its results are good:

```
Model(
(dropout): Dropout(p=0.6, inplace=False)
(word_embedding): Embedding(25413, 100, padding_idx=0)
(tag_embedding): Embedding(20, 40, padding_idx=0)
(bilstm): LSTM(908, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
(bilstm_to_hidden1): Linear(in_features=1200, out_features=500, bias=True)
(hidden1_to_hidden2): Linear(in_features=500, out_features=150, bias=True)
(hidden2_to_pos): Linear(in_features=150, out_features=101, bias=True)
(hidden2_to_dep): Linear(in_features=300, out_features=47, bias=True)
)
```

The bilstm input is formed by:

```
# INPUT: x = torch.Size([50, 61, 140]) and x_lengths = (50,)
x = torch.nn.utils.rnn.pack_padded_sequence(x, x_lengths, batch_first=True)
x, _ = self.bilstm(x)
x, _ = torch.nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
# OUTPUT: x = torch.Size([50, 61, 800]) --> (batch_size, seq_len, n_lstm_units)
x = x.contiguous()
x = x.view(-1, x.shape[2])
# OUTPUT: x = torch.Size([3050, 800]) --> (batch_size * seq_len, n_lstm_units)
etc.
```

What i would like to do is add attention to the lstm (bilstm), so I defined a new model.

```
NewModel(
THE PREVIOUS PART IS THE SAME AS THE ORIGINAL MODEL
(bilstm): MyLSTM(
(dropout): Dropout(p=0.3, inplace=False)
(lstm1): LSTM(908, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
(atten1): Attention()
(lstm2): LSTM(1200, 600, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
(atten2): Attention()
)
THE NEXT PART IS THE SAME AS THE ORIGINAL MODEL
```

at this point my problem is to be able to report the output dimensions of the bilstm, after applying the attention, as those of the original model, so that the remaining part of the model can work.

The bilstm input is formed by:

```
# INPUT: x = torch.Size([50, 61, 140]) and x_len = (50,)
x = nn.utils.rnn.pack_padded_sequence(x, x_len, batch_first=True)
out1, (h_n, c_n) = self.lstm1(x)
x, lengths = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)
# OUTPUT: x = torch.Size([50, 61, 800]) --> (batch_size, seq_len, n_lstm_units)
# OUTPUT: lenghts: x = torch.Size([61])
x, att1 = self.atten1(x, lengths) # skip connect
# OUTPUT: x = Torch.Size(50, 800) and att1 = Torch.Size(50, 61)
out2, (h_n, c_n) = self.lstm2(out1)
y, lengths = nn.utils.rnn.pad_packed_sequence(out2, batch_first=True)
y, att2 = self.atten2(y, lengths)
# OUTPUT: y = Torch.Size(50, 800) and att2 = Torch.Size(50, 61)
z = torch.cat([x, y], dim=1)
return z # torch.Size([64, 1600])
```

So at the same point, after the `forward()`

of the bilstm of the original model and after the `forward()`

of the bilstm with attention I get two different results, respectively: `torch.Size([3050, 800])`

and `torch.Size([64, 1600])`

. The difference comes from the fact that in the first model it is done `(batch_size * seq_len, n_lstm_units) = (50 * 61, 800)`

while in the second model the results of `x, att1`

are the same dimension of `y, att2`

that is (50, 800) and (50, 61), respectively.

Anyone have any idea what i might try to do?

Is there a “standard” way to implement a self-attention module for a bilstm? I’ll put below the code I used if maybe it can be useful to better understand the problem …

```
class Attention(nn.Module):
def __init__(self, hidden_size, batch_first=False):
super(Attention, self).__init__()
self.hidden_size = hidden_size
self.batch_first = batch_first
self.att_weights = nn.Parameter(torch.Tensor(1, hidden_size), requires_grad=True)
stdv = 1.0 / np.sqrt(self.hidden_size)
for weight in self.att_weights:
nn.init.uniform_(weight, -stdv, stdv)
def forward(self, inputs, lengths):
if self.batch_first:
batch_size, max_len = inputs.size()[:2]
else:
max_len, batch_size = inputs.size()[:2]
# matrix mult
# apply attention layer
weights = torch.bmm(inputs,
self.att_weights # (1, hidden_size)
.permute(1, 0) # (hidden_size, 1)
.unsqueeze(0) # (1, hidden_size, 1)
.repeat(batch_size, 1, 1) # (batch_size, hidden_size, 1)
)
attentions = torch.softmax(F.relu(weights.squeeze()), dim=-1)
# create mask based on the sentence lengths
mask = torch.ones(attentions.size(), requires_grad=True).cuda()
for i, l in enumerate(lengths): # skip the first sentence
if l < max_len:
mask[i, l:] = 0
# apply mask and renormalize attention scores (weights)
masked = attentions * mask
_sums = masked.sum(-1).unsqueeze(-1) # sums per row
attentions = masked.div(_sums)
# apply attention weights
weighted = torch.mul(inputs, attentions.unsqueeze(-1).expand_as(inputs))
# get the final fixed vector representations of the sentences
representations = weighted.sum(1).squeeze()
return representations, attentions
class MyLSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, batch_first, bidirectional, dropout):
super(MyLSTM, self).__init__()
self.dropout = nn.Dropout(p=dropout)
self.lstm1 = nn.LSTM(input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=batch_first,
bidirectional=bidirectional,
dropout=dropout)
self.atten1 = Attention(hidden_size * 2, batch_first=batch_first) # 2 is bidrectional
self.lstm2 = nn.LSTM(input_size=hidden_size * 2,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=batch_first,
bidirectional=bidirectional,
dropout=dropout)
self.atten2 = Attention(hidden_size * 2, batch_first=batch_first)
def forward(self, x, x_len):
x = nn.utils.rnn.pack_padded_sequence(x, x_len, batch_first=True)
out1, (h_n, c_n) = self.lstm1(x)
x, lengths = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)
x, att1 = self.atten1(x, lengths) # skip connect
out2, (h_n, c_n) = self.lstm2(out1)
y, lengths = nn.utils.rnn.pad_packed_sequence(out2, batch_first=True)
y, att2 = self.atten2(y, lengths)
z = torch.cat([x, y], dim=1)
return z
```