Hi,
while reading about the ASR project implementation here Building an end-to-end Speech Recognition model in PyTorch I came across a GRU implementation that is unlike any other RNN/GRU/LSTM I have come across.
The reason why I am curious is that this implementation has outperformed every other network I have tried in my experiments.
The implementation is as follows:
This is the GRU:
class BidirectionalGRU(nn.Module):
def __init__(self, rnn_dim, hidden_size, dropout, batch_first):
super(BidirectionalGRU, self).__init__()
self.BiGRU = nn.GRU(
input_size=rnn_dim, hidden_size=hidden_size,
num_layers=1, batch_first=batch_first, bidirectional=True)
self.layer_norm = nn.LayerNorm(rnn_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
x = self.layer_norm(x)
x = F.gelu(x)
x, _ = self.BiGRU(x)
x = self.dropout(x)
return x
Then in the main network class the multilayer GRU is created as follows:
self.birnn_layers = nn.Sequential(*[
BidirectionalGRU(rnn_dim=rnn_dim if i==0 else rnn_dim*2,
hidden_size=rnn_dim,
dropout=dropout,
batch_first=i==0)
for i in range(n_rnn_layers)
])
What I don’t understand is why the first layer has batch_first=True and then all subsequent layers make use of batch_first=False.
If anyone is familiar with why this is being done I would really appreciate any help.