Transformer model loss is increasing

szahan · December 14, 2021, 8:15am

Hi
I am trying to design a transformer model using the nn.TransformerEncoderLayer() and nn.TransformerEncoder() to train in a self-supervised fashion to learn 3D coordinates of skeleton data. The model is as below

Model(
  (joint_embedding): embed(
    (cnn): Sequential(
      (0): norm_data(
        (bn): BatchNorm1d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): cnn1x1(
        (cnn): Conv2d(3, 8, kernel_size=(1, 1), stride=(1, 1))
      )
      (2): ReLU()
    )
  )
  (dif_embedding): embed(
    (cnn): Sequential(
      (0): norm_data(
        (bn): BatchNorm1d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): cnn1x1(
        (cnn): Conv2d(3, 8, kernel_size=(1, 1), stride=(1, 1))
      )
      (2): ReLU()
    )
  )
  (attention): Attention_Layer(
    (att): ST_Joint_Att(
      (fcn): Sequential(
        (0): Conv2d(8, 2, kernel_size=(1, 1), stride=(1, 1))
        (1): BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): Hardswish()
      )
      (conv_t): Conv2d(2, 8, kernel_size=(1, 1), stride=(1, 1))
      (conv_v): Conv2d(2, 8, kernel_size=(1, 1), stride=(1, 1))
    )
    (bn): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (act): Swish()
  )
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (encoder_layer): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=80, out_features=80, bias=True)
    )
    (linear1): Linear(in_features=80, out_features=2048, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=2048, out_features=80, bias=True)
    (norm1): LayerNorm((80,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((80,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=80, out_features=80, bias=True)
        )
        (linear1): Linear(in_features=80, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=80, bias=True)
        (norm1): LayerNorm((80,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((80,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (mlp_head): Sequential(
    (0): LayerNorm((24000,), eps=1e-05, elementwise_affine=True)
    (1): Dropout(p=0.5, inplace=False)
    (2): Linear(in_features=24000, out_features=2048, bias=True)
    (3): ReLU(inplace=True)
    (4): Dropout(p=0.5, inplace=False)
    (5): Linear(in_features=2048, out_features=512, bias=True)
  )
)

I used simclr loss with the augmented sample as positive sample and all other samples in the minibatch as negative samples. Below is the loss curve

I used learning rate 0.001 with 10% reduction every 10th epoch with a warmup epoch of 100. Below is the learning rates graph

I don’t understand why loss is increasing. Initial input shape: Nx3x300x10 and input to the transformer encoder layer is: Nx300x80
My concerns are
First, my model implementation may be incorrect
Second, maybe the loss function is faulty.
I would really appreciate it if you could suggest what should I appoch.