Applying Attention and MLP in a Transformer-Based Prediction Model

I have a question regarding building a Transformer encoder-based model. The application I am working on involves predicting the stress output when strain is applied to an object. The data I have is in the form of input: (6200, 25, 4) and output: (6200, 25, 4), which correspond to each other (sample, sequence, features).

For example, I want to predict the stress at step 13 when I input the entire sequence of strain from step 1 to step 13. At the maximum, I want to predict the stress at step 25 when I input the strain from step 1 to step 25. Of course, I will use a mask to exclude attention from future steps.

In this model, if we look at the flow of data shapes, the input data first passes through a linear layer, transforming from (batch, sequence, features) to (batch, sequence, d_model), and then goes into the encoder. The final output of the encoder will similarly be in the shape of (batch, sequence, d_model).

Next, I want to output the stress at the current step through an MLP. Is it correct to take only the last sequence from the output of the encoder and use it as the input to the MLP? Then it would be possible to predict the current step as (batch, d_model) -> (batch, features).

model code is follows
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
import math

Model Definition

class TransformerModel(nn.Module):
def init(self, input_size,output_size, d_model, nhead, num_layers, dim_feedforward, dropout=0):
super(TransformerModel, self).init()

    self.pos_encoder = PositionalEncoding(d_model, dropout)
    
    encoder_layers = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, dim_feedforward=dim_feedforward, dropout=dropout)
    self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers=num_layers)

    self.encoder = nn.Linear(input_size, d_model)

    self.decoder =  nn.Sequential(
        nn.Linear(d_model, 128),
        nn.ReLU(),
        nn.Linear(128, output_size) 
    )
    self.d_model = d_model

def generate_mask(self, sz):
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def forward(self, src, seq_len):
    src = self.encoder(src) * math.sqrt(self.d_model)
    src = self.pos_encoder(src)
    mask = self.generate_mask(seq_len).to(src.device)
    output = self.transformer_encoder(src.transpose(0, 1), mask).transpose(0, 1)
    current_step_output = self.decoder(output[:,-1,:])
    
    return current_step_output

class PositionalEncoding(nn.Module):
def init(self, d_model, dropout=0, max_len=500):
super(PositionalEncoding, self).init()
self.dropout = nn.Dropout(p=dropout)

    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0).transpose(0, 1)
    self.register_buffer('pe', pe)

def forward(self, x):
    x = x + self.pe[:x.size(0), :]
    return self.dropout(x)

Instead of the method mentioned earlier, would it be acceptable to apply attention only to the strain data (input data) and then use an MLP to transform from (batch, sequence, d_model) to (batch, sequence, features)?

The downside of the previous method is that it requires training for each sequence step of the data. For example, training is needed for step1~1, 1~2, 1~3…1~25, which can be time-consuming.