I’m trying to implement a modified version of LAS model. Everything works fine the loss and wer is decreasing but after the second epochs the attention seems to be wrong because it’s attending not only the right part of the listener features but also the end of the sequence.
Here is an attention plot: https://i.imgur.com/iCp404F.jpg
I think in one point it should attend one part of the sequence not two, its also hard for softmax.
Does anyone have any idea what can cause this? common structural problems or anything?