Usually if you want to get feature maps from a model the typical approach is to edit the forward
function in the model definition to return the intermediate feature maps in addition to the final output: e.g., How to extract features of an image from a trained model - #6 by fmassa
Usually some kind of finetuning would be needed (at least), as classification is kind of a different domain from captioning.