Hello,

I wrote the following code to solve a Seq2Seq regression problem. My implementation is based on the GRU and multi-head attention. The performance is horrible. I tried playing with the hyperparameters, but nothing changed. This led me to think it was a network architecture issue.

```
class Seq2Seq(nn.Module):
def __init__(self, input_size, output_size, hidden, num_heads):
super(Seq2Seq, self).__init__()
self.encoder = nn.GRU(input_size, hidden, 2)
self.decoder = nn.GRU(hidden, hidden, 2)
self.multihead_attn = nn.MultiheadAttention(hidden, num_heads)
self.linear = nn.Linear(hidden, output_size)
self.init_weights()
def init_weights(self):
self.linear.weight.data.normal_(0, 0.1)
def forward(self, x):
encoded, _ = self.encoder(x)
decoded, _ = self.decoder(encoded)
attention_output, _ = self.multihead_attn(decoded, decoded, decoded)
out = self.linear(attention_output)
return out
D_in = 4
D_out = 1
hidden = 16
num_heads = 4
seq2seq = Seq2Seq(input_size=D_in, output_size=D_out, hidden=hidden, num_heads=num_heads)
inputs = torch.rand((7, 100, D_in))
outputs = seq2seq(inputs)
```

Any suggestions are highly appreciated