Yes, indeed.
Yes again.
After reading your link, I am wondering how smth’s answer from here hidden states for all timesteps fits in. I already incorporated this suggestion and it seems to work well. Thanks
I see your point regarding the addition of a Linear Layer to be able to distribute the gradient updates in a better way. I will certainly try this tomorrow. Do you think this might be the cause of the vanishing gradients?
One more clue I’d like to add here is that decreasing the learning rate actually causes a much slower vanishing of gradients. .