Why a layernorm change the whole model?

I am using several residual blocks and then following the head part the model. I tried two different architectures found that the models can different from each via the existing of the layer normalizaion.

The is my model structures

input-> residual blocks → residual blocks → head part;

each of residual blocks comprise the following operation:

  -> conv1d -> layernorm -> conv1d -> GELU ->  +
                         skip  connection

The following are two different head parts of the model

* the first is : 
     max(along the sequence dimension) -> layerNorm -> Linear(in channels, num of classes)
* the second is 
    max(along the sequence dimension) -> linear(in channels, in channels,) -> LayerNom-> Linear(in channels, num of classes)

The result is that: the performance of the first model can achieve 100% on training set while 84% for test test. The same number of the second model is 0.1 % on both train set and test set.

I can not understand why I put the linear operation between the max operation and the LayerNorm making so much negative effects on the performance of the model.