Softmax is bugged?

Using Optimizer: Adam with loss function: MSELoss. I want to train a 5-class classifier.

My model outputs following tensor after first train sample:

[-0.1180, -0.0932, -0.9693, 0.1546, -0.5936]

which becomes the following tensor after softmax is applied:
[0.2279, 0.2337, 0.0973, 0.2994, 0.1417]

This looks perfectly fine.

However after applying optimization, the next thousands of outputs all become:
[-24.2190, 25.5244, -24.5743, -24.6079, -24.7835]
= > [2.4931e-22, 1.0000e+00, 1.7475e-22, 1.6897e-22, 1.4176e-22] which is basically [0,1,0,0,0]
[-39.9633, 40.9215, -40.6142, -40.0271, -39.2909] =>
[7.4504e-36, 1.0000e+00, 3.8857e-36, 6.9895e-36, 1.4595e-35] which is basically [0,1,0,0,0]
[-52.3863, 53.6466, -52.7560, -53.2293, -52.4467] =>
[8.9228e-47, 1.0000e+00, 6.1649e-47, 3.8403e-47, 8.3996e-47] which is basically [0,1,0,0,0]

… etc.

Thus my model never learns. What am i doing wrong? The first iteration looks fine, but then the network start outputting high numbers which doesn’t work together with softmax? Do i need to normalize data before softmax? Do i need to use log_softmax? I cant seem to crack this problem so any help is very appreciated.

For reference my loss the first iterations is:

tensor(0.2033, dtype=torch.float64, grad_fn=)
tensor(0.2752, dtype=torch.float64, grad_fn=)
tensor(0.3007, dtype=torch.float64, grad_fn=)
tensor(0.4000, dtype=torch.float64, grad_fn=)
tensor(0.4000, dtype=torch.float64, grad_fn=)
tensor(0.4000, dtype=torch.float64, grad_fn=)
tensor(0.4000, dtype=torch.float64, grad_fn=)
tensor(0.4000, dtype=torch.float64, grad_fn=)

That might indicate that your learning rate is too high or your loss function is bad.
Once the softmax saturates, backprop will have a problem.

Best regards

Thomas

1 Like

I can try and lower the learning rate, i have currently 0.01 with adam optimizer. I might simply drop MSELoss, but I was doing a hyper-parameter gridsearch test with Loss functions [CrossentropyLoss, MSELoss, MultiMarginLoss] and optimizers [Adam, Adadelta, RMSProp].

I thought it was possible to use MSELess for classifications tasks if the output of the model was Softmax like [0.1, 0.5, 0.1, 0.2, 0.1]and the y-tensor was like [0,1,0,0,0].

But i might be wrong? And since model outputs more like [0,0,0,0,1] results then the loss is constant and nothing is learned.

EDIT:

Okay lowered Lr by a factor of 10, that simply made it take 10 ish iterations before loss stayed constant.
However i lowered it by a factor of 100 and now loss looks like to be chaning properly!

Loss now:
tensor(0.1871, dtype=torch.float64, grad_fn=)
tensor(0.2140, dtype=torch.float64, grad_fn=)
tensor(0.1251, dtype=torch.float64, grad_fn=)
tensor(0.2270, dtype=torch.float64, grad_fn=)
tensor(0.1767, dtype=torch.float64, grad_fn=)
tensor(0.1708, dtype=torch.float64, grad_fn=)
tensor(0.0515, dtype=torch.float64, grad_fn=)
tensor(0.1856, dtype=torch.float64, grad_fn=)
tensor(0.0466, dtype=torch.float64, grad_fn=)
tensor(0.2470, dtype=torch.float64, grad_fn=)
tensor(0.1243, dtype=torch.float64, grad_fn=)

etc.

I will try and use this and simply add 1 couple of hundred more epochs and hope i don’t end up in an early local minimum :slight_smile: