I have built a DNN with only one hidden layer, the following are the parameters:
input_size = 100
hidden_size = 20
output_size = 2
def init():
self.linear1 = nn.Linear()
self.linear2 = nn.Linear()
def forward():
x1 = F.leaky_relu()
return F.leaky_relu() #unimportant codes omitted
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.02)

normalized word vectors of size 100 from authoritative github are used as input
My purpose is to identify whether a word is an event. For example, âdoughtâ is an event but âdogâ is not.
After training, the 2-dimensional output tensors are almost the same (say,(-0.8,-1.20) and (-0.8,-1.21), (-0.2,-1.01) and (-0.2,-1.02)) even if the activation function and loss function are changed.
Could someone tell me the reason? I tried my best but failed to solve it.

Could you check the weight and bias in both layers?
Sometimes, e.g. when the learning rate is too high, the model just learns the âmean predictionâ, i.e. the bias is responsible for most of the prediction, while the weights and input became more or less useless.
For example when I was playing with a facial keypoint dataset, some models just predicted the âmean positionâ of the keypoints, regardless of the input image.

Could you please kindly elaborate more on the bias and âmean predictionâ part? Iâve seen this explanation multiple times on the Internet but cannot get it. When the learning rate is too high, my understanding is that the model wouldnât converge? Why would that results in bias responsible for most of the prediction? Thanks!

Iâm not sure if there is an underlying mathematical explanation for this effect.
In the past I experienced that basically the bias in the last layer took the mean values of the regression task, so regardless of the input, I always got the average of my targets.
Could be an edge case and I donât have a proper explanation for it.

Thatâs probably what happened to my model too, except I did not check that carefully for bias values but I do notice bias getting dominant in terms of scale with regard to weights. The outputs are also indeed the mean.

I solved this by 1) normalizing the input by demean and divide by std and 2) used a smaller learning rate.

This problem may be due to the âbatch normalizationâ. When you are evaluating your model, you should disable batch normalization. Use âmodel.eval()â when you want to evaluate the model (so batch normalization will be disabled) and use âmodel.train()â again when you want train the model.

This means that if we do the validation during training with model.eval(), in the main training loop, before we use optimizer.step(), we should add model.train() ?

You should call model.train() before the forward pass in your training loop.
If you call it before optimizer.step(), the forward pass will have been already executed in eval mode.

If you set bias=False during the initialization of the layer, the internal .bias parameter will be set to None and will thus not be available, which would be different from setting the value of the bias to zero.
The latter case can be achieved by manipulating this parameter e.g. via:

with torch.no_grad():
model.linear_layer.bias.fill_(0.)