I am using several residual blocks and then following the head part the model. I tried two different architectures found that the models can different from each via the existing of the layer normalizaion.
input-> residual blocks → residual blocks → head part;
-> conv1d -> layernorm -> conv1d -> GELU -> + |______________________________________|^ skip connection
* the first is : max(along the sequence dimension) -> layerNorm -> Linear(in channels, num of classes)
* the second is max(along the sequence dimension) -> linear(in channels, in channels,) -> LayerNom-> Linear(in channels, num of classes)
The result is that: the performance of the first model can achieve 100% on training set while 84% for test test. The same number of the second model is 0.1 % on both train set and test set.
I can not understand why I put the linear operation between the max operation and the LayerNorm making so much negative effects on the performance of the model.