im currently working on a classification transformer and using the imdb sentiment dataset.
And I am pretty unsure that my model is working because when i remove the transformer blocks the loss is only 8% higher as with them.
I am using the pretrained 100 dim glove word embeeding vectors. Thats why my hidden dim is also only 100.
as what I’ve seen, your MultiHeadAttention and EncoderLayer implementation should work. Could you show a continuous representation of both losses, please?
For that i need to train the model again. I can send you the losses later.
But i forgot to mention that the model with transformer blocks seems to overfitt cause the validation_acc was in every 5 epochs about 80%. I think thats because i used the pretty small dim of the pretrained, glove word vectors (100) also as hid_dim.
I’ve just seen that you only stack two of the encoder layers what might be too few. Talking about your hidden dimension, shouldn’t it be hid_dim = embed_dim as using different dimensions would complicate it. You could also try 300d for instance.