Loss not decreasing in LSTM network

Hi.

I am trying to write an RNN model, which consists of a simple one-layer LSTM, whose final hidden state is sent through another linear+relu, to another linear output layer (regression problem). It is a univariate timeseries forecasting problem. The LSTM processes the sequence, sends its final hidden state to the dense network, which then forecasts the future values of the same timeseries. I am writing it like this:

class Seq2DenseModel(nn.Module):
    def __init__(self):
        super(Seq2DenseModel, self).__init__()
        self._lstm_output = torch.zeros((64, 256, 256))
        self._lstm_output_flattened = self._lstm_output[:,-1,:]
        self._denselayers = []
        
        # Constructing LSTM layers
        self.lstm = nn.LSTM(1, 256, batch_first=True)
        for attrib in dir(self.lstm):
            if attrib.startswith("weight_ih"):
                xavier_uniform_(self.lstm.__getattr__(attrib))
            elif attrib.startswith("weight_hh"):
                orthogonal_(self.lstm.__getattr__(attrib))
            elif attrib.startswith("bias_"):
                zeros_(self.lstm.__getattr__(attrib))
        
        
        # Constructing Dense layers
        l = nn.Linear(256,256)
        xavier_uniform_(l.weight)
        zeros_(l.bias)
        self._denselayers.append(l)
        self._denselayers.append(nn.ReLU())
        l = nn.Linear(256, 128)
        xavier_uniform_(l.weight)
        zeros_(l.bias)
        self._denselayers.append(l)
        self.decoder = nn.Sequential(*self._denselayers)
    
    
    def forward(self, x):
        self.lstm.flatten_parameters()
        self._lstm_output, _ = self.lstm(x)
        self._lstm_output_flattened = self._lstm_output[:,-1,:] # Extract final hidden state
        out = self.decoder(self._lstm_output_flattened)
        return out

Is this the correct way of writing this kind of RNN? Because when I use this network, my loss (MSE) does not decrease during training. I am doing training normally like you would any PyTorch network. Ignore the initialization settings in the model above, I wanted to do it because I wanted to recreate the exact same network in Keras and compare results. The Keras model which is identical in every way (including initialization, hyperparameters, optimization parameters, etc.) produces good results, it trains normally, so it is not a dataset problem or anything like that. And for the training loop, I am using the same training loop I have used for simple fully connected networks in PyTorch before with no problems, so it is unlikely that the training loop is problematic either.

So if this is the correct way of implementing a simple LSTM connected to a hidden linear+relu followed by an output, why is my model not training?

Full implementation can be found in this notebook, which downloads the data from the net, so it has no dependencies, you can run it yourself. The notebook has lots of bells and whistles, ignore them. Just look at training set loss during training between the Keras and the PyTorch network (identical).