Error in the information entropy implementation!

Hello, I’m currently working on fine-tuning a pre-trained RN50 model that was initially trained on the ImageNet dataset. My goal is to optimize its performance on the Caltech dataset by reducing the information entropy.

softmax_out = F.softmax(logits, dim=1)
entropy = -softmax_out * torch.log(softmax_out + 1e-5)
loss = torch.mean(torch.sum(entropy, dim=1))

However, I encountered this issue even after training for all 50 epochs. I attempted to adjust the learning rate, but no improvements were observed.

`epoch [1/50] batch [1/2] loss 1.8028 acc 21.8750
epoch [1/200] batch [2/2] loss 1.0584 acc 6.2500
epoch [2/200] batch [1/2] loss 1.8191 acc 3.1250
epoch [2/200] batch [2/2] loss 0.8207 acc 0.0000
epoch [3/200] batch [1/2] loss 0.0007 acc 3.1250
epoch [3/200] batch [2/2] oss 0.1916 acc 3.1250
epoch [4/200] batch [1/2] loss -0.0000 acc 3.1250
epoch [4/200] batch [2/2] loss -0.0000 acc 3.1250
epoch [5/200] batch [1/2] loss -0.0000 acc 3.1250
epoch [5/200] batch [2/2] loss 0.2917 acc 3.1250
epoch [6/200] batch [1/2] loss 0.3784 acc 3.1250

Hi External!

I can’t say that I’m surprised by the result. Given that you’ve only shown us
about six epochs, I probably wouldn’t be surprised by almost any result. Even
fifty epochs wouldn’t really be a lot.

You haven’t said much about your use case or how you’re training.

Try training with plain-vanilla SGD (with no momentum nor weight decay)
with a very small learning rate and train, potentially, for a long time. Unless
your model is weird somehow or you’re freezing the final layers of the model
while you fine tune, you should be able to train down to zero “information
entropy.”

When your probabilities, softmax_out, saturate with one of them (along a
given batch-element row) being very close to one and all the others being
very close to zero, your loss will attain its minimum of zero. However, let’s
say that the length of a row (softmax_out.size (1)) is n. Then there will
be n different ways you can minimize the loss (for that row), corresponding
to which of the n probabilities approaches one.

Fine tuning to the point that you converge to a specific one of these equal-loss
minima is probably not what you want. So either your proposed scheme doesn’t
fit your use case or you only want to fine tune a little bit, mostly preserving the
weights and predictions of the pre-trained model.

As an aside, consider, for reasons of numerical stability, using log_softmax():

entropy = -torch.softmax (logits, dim = 1) * torch.log_softmax (logits, dim = 1)

(Note, I don’t think this is relevant to your current issue.)

Best.

K. Frank

Hi KFrank,
Thanks for the response. After a few days struggling with the code, I found out that my model is facing catastrophic forgetting. That’s why there was no improvement.