Problem with Attention Understanding for TimeSeries

Hey everyone,

I am working on a private dataset to forecast patient visits at the Emergency Department.

I tried to create an Encoder-Decoder model based on LSTM (Seq2Seq) with General Attention. However, I am facing some issues running the code and have a few questions because I am not used to working with sequences.

  1. My dataset has 10 columns, one of which is the target variable.
  • Should my X tensor be of shape (batch_size, sequence_length, 9) or (batch_size, sequence_length, 10) including the target?
  1. Should my y tensor be of shape (batch_size, forecasting_horizon) or (batch_size, seq, forecasting_horizon)?
  2. I can’t share the dataset, but here is some code for context:

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, attention, teacher_ratio, bidirectional=1, device="cpu"):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.attention = attention
        self.fc = nn.Linear(self.encoder.hidden_size*bidirectional, 1)
        self.device = device
        self.teacher_ratio = teacher_ratio
    
    def _get_top_layer_hidden_state(self, hidden_state):
        hidden_state, _ = hidden_state
        return hidden_state[-1, :, :]

    
    def forward(self, batch):
        x, y = batch
        x = x.to(self.device)

        encoded_outputs, hidden_state = self.encoder(x)
        y = y.unsqueeze(2)
        y_hat = torch.zeros_like(y, device=y.device)
        dec_input = x[:, -1:, :] 

        for i in range(y.size(1)):
            top_hidden_state = self._get_top_layer_hidden_state(hidden_state)
            context = self.attention(top_hidden_state.unsqueeze(1), encoded_outputs)
            dec_input = torch.cat((dec_input, context.unsqueeze(1)), dim=-1)

            output, hidden_state = self.decoder(dec_input, hidden_state)
            output = self.fc(output)

            y_hat[:, i, :] = output.squeeze(1)

            # Improved teacher forcing implementation:
            teacher_force = random.random() < self.teacher_ratio
            if teacher_force:
                # Use the ground truth token for better training stability
                dec_input = y[:, i, :].unsqueeze(1)
            else:
                # Use the model's prediction as the next decoder input
                dec_input = output
            dec_input = dec_input.to(x.device)
        return y_hat, y

model = Seq2Seq(encoder=nn.LSTM(input_size, hidden_size, num_layers=num_layers, dropout=0., batch_first=True), 
                    decoder=nn.LSTM(input_size+hidden_size, hidden_size, num_layers=num_layers, dropout=0., batch_first=True), 
                    attention=GeneralAttention(encoder_dim=hidden_size, decoder_dim=hidden_size), 
                    teacher_ratio=0.3, 
                    bidirectional=1,
                    device=device).to(device)


    loss_fn = nn.MSELoss()
    optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

    train_loader = train_dataset.tensor_loader.loader
    val_loader = val_dataset.tensor_loader.loader

    best_val_loss = float('inf')
    current_patience = 0
    best_model_state_dict = None

    # Training loop
    for epoch in range(num_epochs):
        model.train()
        training_losses = []
        for batch in train_loader:
            optimizer.zero_grad()
            y_hat, y = model(batch)
            loss = loss_fn(y_hat, y)
            loss.backward()
            optimizer.step()
            training_losses.append(loss.item())
            ....

When running the forward method in the first epoch, my code executes entirely for i=0. However, I encounter an issue when i=1.

As you can see, dec_input changes shape:


teacher_force = random.random() < self.teacher_ratio
            if teacher_force:
                # Use the ground truth token for better training stability
                dec_input = y[:, i, :].unsqueeze(1)
            else:
                # Use the model's prediction as the next decoder input
                dec_input = output

When i=0, my dec_input is fed into self.decoder(...) with shape (64, 1, 290), where 64 is my batch size, 1 comes from the last hidden state of encoder_input, and 290 is 280 hidden_size + 10 input feature size.

However, after the teacher forcing, dec_input changes to (64, 1, 1), resulting in a dec_input of (64, 1, 281), causing an error when trying to decode:

RuntimeError: input.size(-1) must be equal to input_size. Expected 290, got 281

dec_input = torch.cat((dec_input, context.unsqueeze(1)), dim=-1)

output, hidden_state = self.decoder(dec_input, hidden_state)

This occurs because my decoder is set to have an input_size of encoder_input_size + hidden_size. I am missing some understanding of the Seq2Seq concept and its behavior.

Thanks for your help!

Trying to up this, I still need an answer if someone can help me.

Thanks a lot