Hi,
I was training an encoder-decoder model where the encoder model has only a GRU, and the decoder is a simple MLP which is given the hidden state of GRU as the input. My data is stock market prices (Open, High, Low, Close). The inputs to the encoder are sequences of OHLC prices (say for example 20 OHLCs where 20 is the window size). The sequence in each window is independent of other windows. During the training of the model, after some iterations, all the gradients become zero (gradients of the encoder) and the hidden states become all the same. I initialize the hidden state to zero before feeding the input to the encoder for each input. Here is my encoder model:

Hello, If itâ€™s a gradient vansihing problem, this can be solved using clipping gradient.
You can do this using by registering a simple backward hook.

clip_value = 0.5
for p in model.parameters():
p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

Hi @Usama_Hasan
I have initialized my weights using a Normal distribution but that didnâ€™t work either. My model diverges after some iterations. I think itâ€™s because of the loss function. I used Smooth L1 Loss. I should try it to make sure if the problem is with loss function or not. This problem became so ambiguous to me.

I have solved the problem :))
It was because I havenâ€™t normalized my input data. In this problem, I have multiple models including MLP, Convolutional and LSTM/GRU. The MLP and Convolutional modelsâ€™ performance was good without normalization. Thatâ€™s why I havenâ€™t normalized my input data before.

Hy @Mehran_tgn, Itâ€™s good to hear that you have solved your problem, itâ€™s sometimes really hard to solve a problem or find a fix without all the details. My advice try to make a post with all the possible things which can cause such a behaviour.

Hi @Usama_Hasan
Thank you so much
I have tried some other stuff which can be related to this one. I summarize all the stuff I have done here:

Sometimes, your data consists of independent sequences, meaning that you have a window that moves over your data. The data inside the window is a sequence, and sequences of data in different windows are independent. This way, you should detatch_() the hidden part in LSTM or GRU so that the history would be cut. Otherwise, in backpropagation through time, the backprop would go all the way to the beginning of the input data history.

If your data contains very different values, such as Stock Market prices, try to normalize your data first. You can do it by, for example, MinMaxScaler() in sklearn library.

Donâ€™t forget to do gradient clipping in your model when using sequential data

hey , thank you for your elaborated answer. I also have the vanishing gradient problem with a network of 2 CNN layers by two stacked GRus. my data is sensors and fed to the network in windows each window has 4 channels and length 1450 ( sensors)
How did you normalise your data?
is it normalising each window using the values of only this window or relatively to all samples in the the daaa?