Combine predictions of torchvision and torchtext model

I am working on a multi-class image classification problem. The images being classified each have an associated text description. I began by first creating an image-only model (ResNet) that classifies the images based solely on the image (no text at all). I also trained a torchtext model (TextCNN) that classifies the images based solely on their text annotations.

Is there some way I can combine the outputs of the torchvison and torchtext model? Right now I’m just looking at the top 3 predictions from each model and selecting the class with the highest probability but I think there is probably a better way to do this.