What model architecture should I use to extract all text from any image (OCR use case)?

I am working on a use case where I want to extract all visible text from any kind of image — including medicine packages, product cartons, documents, and natural scenes.
What I want to build:
A deep learning-based OCR model that takes an image as input and returns all the text written on it.
I have tried:

  1. CRNN Model (CNN + BiLSTM + CTC loss):
    I implemented a basic CRNN using Keras. The model includes:
    • CNN layers for feature extraction
    • BiLSTM layers for sequence modeling
    • Dense + CTC loss for character decoding

Any architecture suggestions, research paper links, or open-source code would be helpful