Layernorm questions with Transformers

So my current model has two transformers, (a and b), and we calculate the output from this a and b. For b we run a LayerNorm operation, then we concatenate to create ab. This is a late fusion concatenation model.

From ab we just run a Dropout and then a Linear layer to classify.

Now my model has started to overfit the train set and generalize poorly on the validation set.

Extra details: I have 8000 pieces of data for my train, with batch sizes of (1.2,4). Where I use a weighted random sampler and unweighted crossEntropyLoss to deal with class imbalance already.

Questions:

  1. Can the LayerNorm used for b also be run on a, or do I require separate Layernorm layers?

  2. Now since I use LayerNorm before the concatenation, would it be fine if I just applied LayerNorm to ab, or rather the concatenation of a and b? Instead of creating two separate LayerNorms. or would the two separate outputs being concatenated first and then normed have a weird effect in losing understanding the single transformers both gave?

  3. Should I even use LayerNorm at all?

  4. For the classification head, is it right to be:
    Transformer outputs → Concatenation → LayerNorm → Dropout → Linear
    Currently, I am doing:
    Transformer outputs → LayerNorm (only on b) → Concatenation → Dropout → Linear

  5. Should I apply LayerNorm on the validation and test datasets?

  6. My Current dropout is 0.2, would it generalize better to just use it at .5 or something greater than .2 for less overfitting?

  7. Any tips to help reduce overfitting and increase generalizability would be welcome.

Thank you in advance. :slight_smile:

Any help or tips would be appreciated @ptrblck