Why there are commonly used 2 dense layers at the end of image captioning models

I see common models use 2 dense layer after merging image and text features. I understand the last dense layer used for predicition with softmax. But before softmax there is a dense layer with Relu. I think
it is used to get all features as an output and feed as input to softmax for final prediction. Is this true?

Yes, the penultimate layer would process the incoming features and pass it to the last classification layer.
This might be beneficial, but depends on your use case and model.
E.g. resnet50 uses a single linear layer without a preceding one.

1 Like

thanks for your reply, yes my accuracy is increased when i added another dense layer before softmax. But dont know the reason actually…normally i feed the features directly to softmax for prediction and now features pass through a dense layer with relu and then softmax predicts

Another additional layer adds more capacity to the model, which might help.

Are you using F.softmax of an nn.Softmax module as the last non-linearity?
If so, this might be wrong for a multi-class classification use case, since e.g. nn.CrossEntropyLoss expects raw logits while nn.NLLLoss expects log probabilities created with F.log_softmax.

1 Like

I don’t use softmax, i take maximum logit with argmax to get the next preducted word as a numerical value