I have trained a transformer (from scratch) in PyTorch (many times in many ways) but steel gives me a terrible accuracy just like it is still random
but in Keras, I trained the same network once and it gave me absolutely perfect performance even with a 5 samples dataset (I tested the model with same 5 samples in both frameworks but still Keras gives great result )
I checked the architecture many times but I haven’t figure out the problem yet!
I reeeaally appreciate if you help to me and other people with the same problem
I think it’s worth nothing to say that if I give whole input and output to the model (in inference phase) it would work perfectly (I mean without using teacher forcing it works)
notebooks are here: