So my current model has two transformers, (a and b), and we calculate the output from this a and b. For b we run a LayerNorm operation, then we concatenate to create ab. This is a late fusion concatenation model.
From ab we just run a Dropout and then a Linear layer to classify.
Now my model has started to overfit the train set and generalize poorly on the validation set.
Extra details: I have 8000 pieces of data for my train, with batch sizes of (1.2,4). Where I use a weighted random sampler and unweighted crossEntropyLoss to deal with class imbalance already.
Questions:
-
Can the LayerNorm used for b also be run on a, or do I require separate Layernorm layers?
-
Now since I use LayerNorm before the concatenation, would it be fine if I just applied LayerNorm to ab, or rather the concatenation of a and b? Instead of creating two separate LayerNorms. or would the two separate outputs being concatenated first and then normed have a weird effect in losing understanding the single transformers both gave?
-
Should I even use LayerNorm at all?
-
For the classification head, is it right to be:
Transformer outputs → Concatenation → LayerNorm → Dropout → Linear
Currently, I am doing:
Transformer outputs → LayerNorm (only on b) → Concatenation → Dropout → Linear -
Should I apply LayerNorm on the validation and test datasets?
-
My Current dropout is 0.2, would it generalize better to just use it at .5 or something greater than .2 for less overfitting?
-
Any tips to help reduce overfitting and increase generalizability would be welcome.
Thank you in advance.