Encoder-decoder text summarisation

Does anybody have experience with encoder decoder architectures?
I am doing a project on abstractive text summarisation and i’ve run into an issue where the decoder always predicts the end of sequence token when using greedy selection (the highest probability is always the eos token). This is made seemingly worse when calculating attention.

I understand that beam search is generally used to construct the most likely sentence but should the highest probability ALWAYS be the same eos token? is this something that is common and has a common cause?