I trained same network with pytorch but totally bad results

I have trained a transformer (from scratch) in PyTorch (many times in many ways) but steel gives me a terrible accuracy just like it is still random
but in Keras, I trained the same network once and it gave me absolutely perfect performance even with a 5 samples dataset (I tested the model with same 5 samples in both frameworks but still Keras gives great result )
I checked the architecture many times but I haven’t figure out the problem yet!
I reeeaally appreciate if you help to me and other people with the same problem

I think it’s worth nothing to say that if I give whole input and output to the model (in inference phase) it would work perfectly (I mean without using teacher forcing it works)

notebooks are here:
Keras: https://colab.research.google.com/drive/1cz5Q-FgThpmM0lAT1mI8AgCRNi9ctsl-?usp=sharing
PyTorch: https://colab.research.google.com/drive/1GlWjfKo9_GYz93rW4ZWyR6VSN45Bec3v?usp=sharing