Using text within an image to aid in image classifcation

I am using a pre-trained resnet50 model to do image classification on common candies seen in American grocery stores. I have 103 classes and 1,400 sample images. The model works relatively well however its most common mistake is confusing different types of specific candies (ex: sour patch kids vs. sour patch kids or Hersheys dark chocolate with milk chocolate). However, the text on the packaging is very different.


I am wondering if it’s possible to explicitly include some level of OCR (optical character recognition) into the model - or do some sort of ensembling with a torchtext model? I was thinking that when I process the image to also run it through either tesseract or paddleOCR to get the image’s readable text and feed that into a torchtext model? Is there some common paradigm which can be followed when doing image classification where the text within the image can be used to aid in classification? The actual training images I’m using are from within stores so text isn’t always available to aid in the classifcation