How to design a decoder for time series regression in Transformer?

JiangtaoLiu · September 22, 2022, 11:29pm

I am using Transformer for time series regression (not forecasting). My input data has the structure [batch, seq_size, embedding_dim], and my output structure is [batch, seq_size, 1]. But the result of the model is always overfitting and is worse than the LSTM. I don’t want to include the target information in the decoder. Can anyone tell me how to design the decoder?

import torch.nn as nn

class Transformer(nn.Module):

    def __init__(self, d_input, n_head, d_model, n_layer):
        super(Transformer, self).__init__()

        self.enc_position_embedding = self.data_position_embedding(c_in=d_input, d_model=d_model)
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=d_model,  nhead=n_head)
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=n_layer)

        self.project = nn.Linear(d_model, 1)

    def forward(self, x):

        x_emb = self.enc_embedding(x)
        enc = self.encoder(x_emb)

        dec = enc

        output = self.project(dec)
        return output

Structure:

loss curve:

WowPy · December 15, 2022, 1:54pm

Hello,

I have a similar scenario. Did you manage to solve the issue?

basilique · December 26, 2022, 9:54pm

I have the exact same problem. Did you find out how this issue should be addressed? Thank you.

Data_Curious · June 16, 2024, 1:38pm

I just came across this post.
Why are you using both encoder and decoder? If you’re not feeding decoder output (“Output Probabilities”, or in your case, output from the preceding Linear layer), I don’t think you need the decoder. You are not using cross-attention. You’ve got the what are you “Output embedding” block crossed out… So, it looks like you’re not feeding anything into the Masked Multihead Attention block.

You should try using the encode only. Instead of feeding the encoder Add & Norm output into a decoder, just feed it into a Linear layer for your output (and then, possible, a ReLU or tanh layer, depending on the normalization of your data).