Ensemble resnetk + bow features

Hello! Is it possible to get the features from the last layer of a resnetk (18, 34 or 50) that is trained over a set of images, whereas I would like to add it up with a set of bag of words features, and then, provide the features from the last resnetk layer (without the FC layer) + the bow features to another FC layer to do the classification process?

In summarize, I want to make an “early fusion” to classify both features of images and text… But since I’m new to this framework, I still could not figure it out how to do.

PS: To train, only the resnetk models, I’m using this train function provided in the tutorial page of pytorch (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)