Struggling converting Tensorflow GRU to Pytorch

Hello everyone,

I’ve been trying to replicate the results from a Tensorflow RNN model (on Kaggle) by using Pytorch + fastai without success.
I think this is more of a Pytorch question than fastai; if not, I’ll take it to the fastai forum.

Here is the original TF kernel:

And my Pytorch version:

I’ve tried to keep as close to the original version as possible. Differences are:

  • The “generator” function has been replaced by a Dataset
  • My Pytorch model is defined in class EQRNN
  • I use fastai to run the loop.

With the Pytorch version, my training loss is stuck around 3.0 (2.6 test loss). While the TF model does much better (training loss around 2.0, test loss around 1.6).

This is the TF model:

model = Sequential()
model.add(CuDNNGRU(48, input_shape=(None, n_features)))
model.add(Dense(10, activation='relu'))

And my Pytorch version:

lass EQRNN(nn.Module):
    def __init__(self, input_size = 12,  hidden_size=48,num_layers=1,bidirectional=False, dropout=0.5):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bidirectional,self.num_layers = bidirectional,num_layers
        if bidirectional: self.num_directions = 2
        else: self.num_directions = 1
        self.rnn = nn.GRU(input_size, hidden_size,bidirectional=self.bidirectional,batch_first=True)
        self.final_layers = nn.Sequential(
            nn.Linear(self.num_directions * hidden_size,10),    
    def forward(self,input_seq):
        #output of shape ( batch_size, seq_len, num_directions * hidden_size)
        #h_n (not needed)
        output, h_n = self.rnn(input_seq)#,h_0)
        output = output[:,-1,:]
        output = self.final_layers(output)
        return output

Am I missing something?

Let me know if I have given all the information needed; I feel like I am asking a lot of you guys to review my code… Thank you so much!

I dug into it more and it seems that, for the Pytorch version, the outputs of the RNN almost all converge to 1/-1. Seems like we are saturating the tanh activation…
It’s strange that this is occurring with Pytorch and not Tensorflow. I was also able to overfit a sample train batch on TF but not on Pytorch.

Someone had a similar problem here

If you have any suggestions, please let me know! Thanks