I wrote the following code to solve a Seq2Seq regression problem. My implementation is based on the GRU and multi-head attention. The performance is horrible. I tried playing with the hyperparameters, but nothing changed. This led me to think it was a network architecture issue.
I don’t have a real answer, just some food for thoughts:
I’m not sure intuitive it is to use nn.MultiHeadAttention on the output of a nn.GRU. nn.MultiHeadAttention basically implements self-attention which generally assumes that the sequence elements are “independent” like word (vectors). However the output of a nn.GRU is different as the output at step T captures to some extend the outputs from all previous steps (T-1).
At lease from by basic experience, transformers a difficult to train from scratch; usually you use pretrained models.
Strictly speaking your model does not implement a Seq2Seq task but a sequence labeling tasks. i.e., you get an output for each input word/item. I actually can see what kind of regression problem you’re trying to solve
Have you tried a more basic model by just using the nn.GRU? How do the results compare? It’s often better to first try a simple architectures and then extending it.