Can't overfit RNN to small batch of samples

Hello,

I’ve written previously about how I was having trouble converting a TF/Keras model to Pytorch and getting the same results.

I didn’t get a response, so in order to simplify things I built this notebook:
https://www.kaggle.com/sdoria/rnn-with-toy-data?scriptVersionId=11719264

I have a dataset of 10 randomly generated (150 timesteps * 12 features) inputs, and their corresponding float outputs (this is a regression).

RNN Model is defined as below

class RNN(nn.Module):
    def __init__(self, input_size = 12, hidden_size=48, num_layers=1, bidirectional=False):
        
        super().__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.bidirectional,self.num_layers = bidirectional,num_layers
        if bidirectional: self.num_directions = 2
        else: self.num_directions = 1
                                
        self.rnn = nn.GRU(input_size, hidden_size, bidirectional=self.bidirectional, batch_first=True, num_layers = num_layers)
       
        self.final_layers = nn.Sequential(
            
            nn.Linear(self.num_directions * hidden_size,10),    
            
            nn.ReLU(),        
            nn.Linear(10,1),    
        )
        
    def forward(self,input_seq):
    
      
        output, h_n = self.rnn(input_seq)        
        
        output = output[:,-1,:]
        
        output = self.final_layers(output)
        
                
        return output

When trying to overfit my 10 datapoints, my training loss eventually gets stuck.

Model outputs are all very close to each other:
[tensor([[5.2029],
[5.2068],
[5.2099],
[5.2129],
[5.2188],
[5.2141],
[5.2111],
[5.2120],
[5.2156],
[5.2176]])

When our targets are the following:
tensor([ 4.3582, 9.1221, 0.4407, 0.3569, 2.3914, 5.2743, 5.6834, 12.2206,
8.0923, 3.9258])]

I’ve seen signs that some of the hidden layers are getting saturated (+1/-1). Probably explains why all the outputs look the same.

I tried normalizing data, clipping gradients with no success

As mentioned in my original post, the same model in Tensorflow does not have the same issue.

Thanks for reading and if you have any suggestions or need more information let me know!