Problem of Vanishing Gradients in GRU in Seq2Seq model

Hi,
I was training an encoder-decoder model where the encoder model has only a GRU, and the decoder is a simple MLP which is given the hidden state of GRU as the input. My data is stock market prices (Open, High, Low, Close). The inputs to the encoder are sequences of OHLC prices (say for example 20 OHLCs where 20 is the window size). The sequence in each window is independent of other windows. During the training of the model, after some iterations, all the gradients become zero (gradients of the encoder) and the hidden states become all the same. I initialize the hidden state to zero before feeding the input to the encoder for each input. Here is my encoder model:

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, device):
        super(EncoderRNN, self).__init__()
        self.device = device
        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size, hidden_size)

    def forward(self, x):
        if len(x.shape) < 3:
            x = x.unsqueeze(1)

        hidden = self.initHidden(x.shape[1])

        output, hidden = self.gru(x, hidden)
        return output, hidden

    def initHidden(self, batch_size):
        return torch.zeros(1, batch_size, self.hidden_size, device=self.device)

I feed data to the model in batches, and x.shape[1] shows the batch size.
I would be appreciated if someone can help me :slight_smile:

Hello, If it’s a gradient vansihing problem, this can be solved using clipping gradient.
You can do this using by registering a simple backward hook.

clip_value = 0.5
for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

Thank you so much for your answer
Actually I clip gradients in each iteration. Here is the code:

        for param in model.parameters():
            param.grad.data.clamp_(-1, 1)

Can the problem be related to the input data?

Hey will you try initializing weights of the model.

Sorry, Usama_Hasan but how can it help?

Hy @Mehran_tgn, it’s better to have weights initialized for better convergence

Hi @Usama_Hasan
I have initialized my weights using a Normal distribution but that didn’t work either. My model diverges after some iterations. I think it’s because of the loss function. I used Smooth L1 Loss. I should try it to make sure if the problem is with loss function or not. This problem became so ambiguous to me.

I have solved the problem :))
It was because I haven’t normalized my input data. In this problem, I have multiple models including MLP, Convolutional and LSTM/GRU. The MLP and Convolutional models’ performance was good without normalization. That’s why I haven’t normalized my input data before.

1 Like

Hy @Mehran_tgn, It’s good to hear that you have solved your problem, it’s sometimes really hard to solve a problem or find a fix without all the details. My advice try to make a post with all the possible things which can cause such a behaviour. :slight_smile:

Hi @Usama_Hasan
Thank you so much
I have tried some other stuff which can be related to this one. I summarize all the stuff I have done here:

  1. Sometimes, your data consists of independent sequences, meaning that you have a window that moves over your data. The data inside the window is a sequence, and sequences of data in different windows are independent. This way, you should detatch_() the hidden part in LSTM or GRU so that the history would be cut. Otherwise, in backpropagation through time, the backprop would go all the way to the beginning of the input data history.
  2. If your data contains very different values, such as Stock Market prices, try to normalize your data first. You can do it by, for example, MinMaxScaler() in sklearn library.
  3. Don’t forget to do gradient clipping in your model when using sequential data
1 Like

hey , thank you for your elaborated answer. I also have the vanishing gradient problem with a network of 2 CNN layers by two stacked GRus. my data is sensors and fed to the network in windows each window has 4 channels and length 1450 ( sensors)
How did you normalise your data?
is it normalising each window using the values of only this window or relatively to all samples in the the daaa?