Unable to achieve better performance with transformer than LSTM

I am trying to train an ML model on time series data. The input is 10 timeseries which are essentially a sensor data. The output is another set of three time series. I feed the model with the window of 100. So, the input shape becomes (100, 10). I want to predict output time series values for single time step. So, the output shape becomes (1, 3). (If I create mini batches of size say x, the input and output shapes become (x, 100, 10) and (x, 1, 3)).

My approach is to first overfit the model on smaller number of records. See if model is indeed learning / able to overfit the data. Then add some regularization (mostly dropout) and then try to train the model on full dataset.

First, I tried to overfit LSTM model (4 LSTM layers with 1400 hidden units each) on small dataset and visualised the outcome. It did well. So, I tried to train it on the whole dataset. It did okayish, but still struggled at some places.

I tried adding dropouts too, but it did not yield any significant improvement. So, I tried to train PatchTST transformer model. First, I tried to overfit smaller model and did well. In fact, when I visualized the output, I realised that it was able to get tighter overfit than the LSTM model. So, I tried to train it on the whole dataset.

The initial version of PatchTST I tried is as follows:

config = PatchTSTConfig(
    num_input_channels=10,
    context_length=100,
    num_targets=3,
    patch_length=10,
    patch_stride=10,
    prediction_length=1,
    num_hidden_layers=4,
    num_attention_heads=8,
    d_model=600,
)

model = PatchTSTForRegression(config)

# Model Trainer body looks like this:
patch_tst_y_hat = self.model(X)
y_hat = patch_tst_y_hat.regression_outputs
# for LSTM it was:
# y_hat = self.model(X)
loss = self.loss_fn(y_hat, Y)
self.optimizer.zero_grad()   
loss.backward()
self.optimizer.step()

With this as base config, I tried different changes to it for hyper parameter optimization:

  1. layers = 7
  2. d_model = 1000
  3. d_model = 1000, num_attention_heads = 4

The validation loss curves as follows:

The training loss curves also look the same: image (Forum is allowing me to insert only one image)

Can someone suggest any idea to improve LSTM model performance further with PatchTST or some other transformer related changes?

PS: I have also tried LSTM with attention, it did yield tiny improvement, which is not sufficient. I am also open for any non-transformer related suggestions too. But will love to learn how can we achieve better performance with transformer.