while reading about the ASR project implementation here Building an end-to-end Speech Recognition model in PyTorch I came across a GRU implementation that is unlike any other RNN/GRU/LSTM I have come across.
The reason why I am curious is that this implementation has outperformed every other network I have tried in my experiments.
The implementation is as follows:
This is the GRU:
class BidirectionalGRU(nn.Module): def __init__(self, rnn_dim, hidden_size, dropout, batch_first): super(BidirectionalGRU, self).__init__() self.BiGRU = nn.GRU( input_size=rnn_dim, hidden_size=hidden_size, num_layers=1, batch_first=batch_first, bidirectional=True) self.layer_norm = nn.LayerNorm(rnn_dim) self.dropout = nn.Dropout(dropout) def forward(self, x): x = self.layer_norm(x) x = F.gelu(x) x, _ = self.BiGRU(x) x = self.dropout(x) return x
Then in the main network class the multilayer GRU is created as follows:
self.birnn_layers = nn.Sequential(*[ BidirectionalGRU(rnn_dim=rnn_dim if i==0 else rnn_dim*2, hidden_size=rnn_dim, dropout=dropout, batch_first=i==0) for i in range(n_rnn_layers) ])
What I don’t understand is why the first layer has batch_first=True and then all subsequent layers make use of batch_first=False.
If anyone is familiar with why this is being done I would really appreciate any help.