Transformer erratic training loss when given more training data

Hello, I am have a model for classifies single particle trajectories (according to what algorithm was used to generate the trajectories).

My current model structure has 2 convolutional layers a bidirectional LSTM of length 3 and some Linear layers for output.

Here is structure of my convolutional LSTM

ConejeroConvNet(
  (ConvBlock): Sequential(
    (0): Conv1d(1, 20, kernel_size=(3,), stride=(1,))
    (1): ReLU()
    (2): Conv1d(20, 64, kernel_size=(3,), stride=(1,))
    (3): ReLU()
    (4): Dropout(p=0.2, inplace=False)
    (5): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (bi_lstm): LSTM(498, 32, num_layers=3, batch_first=True)
  (linearOuts): Sequential(
    (0): Linear(in_features=4096, out_features=1000, bias=True)
    (1): ReLU()
    (2): Linear(in_features=1000, out_features=50, bias=True)
    (3): ReLU()
    (4): Linear(in_features=50, out_features=5, bias=True)
  )
)

I replaced the LSTM with a classifier transformer adapted from the classifier transformer here: http://peterbloem.nl/blog/transformers

Here is the structure of my transformer

convTransformer(
  (ConvBlock): Sequential(
    (0): Conv1d(1, 20, kernel_size=(3,), stride=(1,))
    (1): ReLU()
    (2): Conv1d(20, 64, kernel_size=(3,), stride=(1,))
    (3): ReLU()
    (4): Dropout(p=0.2, inplace=False)
    (5): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (cTrans): ClassifierTransformer(
    (transBlocks): Sequential(
      (0): TransformerBlock(
        (attention): SelfAttentionNarrow(
          (toKeys): Linear(in_features=8, out_features=8, bias=False)
          (toQueries): Linear(in_features=8, out_features=8, bias=False)
          (toValues): Linear(in_features=8, out_features=8, bias=False)
          (unifyHeads): Linear(in_features=64, out_features=64, bias=True)
        )
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (ff): Sequential(
          (0): Linear(in_features=64, out_features=256, bias=True)
          (1): ReLU()
          (2): Linear(in_features=256, out_features=64, bias=True)
        )
        (dropOut): Dropout(p=0.2, inplace=False)
      )
      (1): TransformerBlock(
        (attention): SelfAttentionNarrow(
          (toKeys): Linear(in_features=8, out_features=8, bias=False)
          (toQueries): Linear(in_features=8, out_features=8, bias=False)
          (toValues): Linear(in_features=8, out_features=8, bias=False)
          (unifyHeads): Linear(in_features=64, out_features=64, bias=True)
        )
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (ff): Sequential(
          (0): Linear(in_features=64, out_features=256, bias=True)
          (1): ReLU()
          (2): Linear(in_features=256, out_features=64, bias=True)
        )
        (dropOut): Dropout(p=0.2, inplace=False)
      )
      (2): TransformerBlock(
        (attention): SelfAttentionNarrow(
          (toKeys): Linear(in_features=8, out_features=8, bias=False)
          (toQueries): Linear(in_features=8, out_features=8, bias=False)
          (toValues): Linear(in_features=8, out_features=8, bias=False)
          (unifyHeads): Linear(in_features=64, out_features=64, bias=True)
        )
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (ff): Sequential(
          (0): Linear(in_features=64, out_features=256, bias=True)
          (1): ReLU()
          (2): Linear(in_features=256, out_features=64, bias=True)
        )
        (dropOut): Dropout(p=0.2, inplace=False)
      )
..............[more transformer blocks].......................
      (9): TransformerBlock(
        (attention): SelfAttentionNarrow(
          (toKeys): Linear(in_features=8, out_features=8, bias=False)
          (toQueries): Linear(in_features=8, out_features=8, bias=False)
          (toValues): Linear(in_features=8, out_features=8, bias=False)
          (unifyHeads): Linear(in_features=64, out_features=64, bias=True)
        )
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (ff): Sequential(
          (0): Linear(in_features=64, out_features=256, bias=True)
          (1): ReLU()
          (2): Linear(in_features=256, out_features=64, bias=True)
        )
        (dropOut): Dropout(p=0.2, inplace=False)
      )
    )
    (dropout): Dropout(p=0.2, inplace=False)
    (linOut): Linear(in_features=64, out_features=5, bias=True)
  )
)

With data set of size 10k (7.5k for training 2.5 for test) the transfomer outperformed my convolutional LSTM by 9%. However, when I tried it with a data set of size 150k (110k train 40k test) the training loss went all over the place and it performed only slightly better than guessing randomly.

This is what my model does when given 110k training data. the losses are jumping around and never go below 3.

As you can see below. it behaves as expected with 7.5k training data.
image
the losses dont jump as much and end around .7.

I kept everything the same between the runs (batch size, optimizer criterion, epochs, patience, etc.). I should add that I am using a modified version of Early stopping available here: https://github.com/Bjarten/early-stopping-pytorch/blob/master/MNIST_Early_Stopping_example.ipynb.

I can add more detail if necessary. However, I think that my problem is conceptual as I do not understand why only my transformer model is acting like this. When I add more data to my convolutional LSTM model it works fine and it improves the models capability of prediction.

Thank you very much for your help. I appreciate any suggestions to help me fix the problem.

edited: formatting and typos

Responding to my own question in case anyone ever has this question in the future.

I introduced gradient clipping as shown here:
https://stackoverflow.com/questions/54716377/how-to-do-gradient-clipping-in-pytorch

for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

This appears to have remediated the problem.