I am working on a use case where I want to extract all visible text from any kind of image — including medicine packages, product cartons, documents, and natural scenes.
What I want to build:
A deep learning-based OCR model that takes an image as input and returns all the text written on it.
I have tried:
CRNN Model (CNN + BiLSTM + CTC loss):
I implemented a basic CRNN using Keras. The model includes:
CNN layers for feature extraction
BiLSTM layers for sequence modeling
Dense + CTC loss for character decoding
Any architecture suggestions, research paper links, or open-source code would be helpful
Hi, have you tried any VLM-models for OCR? There are many new architectures and strategies that might work for your case! Especially because it is on all kinds of different scenes on which VLMs should also have better capabilities. Some place to start for instance: VLM For OCR - a Cuiunbo Collection.
Another option is to use Tesseract, a library that has been around for quite some time, but still has active development. They also have a python wrapper that you can use: pytesseract·PyPI.
Finally, you could also find datasets and pretrained detection model for all existing characters and fine tune this model yourself on your data. This is definitely the most work, but also gives you the most freedom to change the architecture and data in any way you want.
Your current CRNN model has a fundamental limitation, it’s designed to read text that flows in a single, predictable direction, like reading a single line of text. This works for simple cases but fails when confronted with the complex scenes where text appears at different angles, sizes, and locations throughout the image.
PaddleOCR in my personal experience is more robust to these noises.
The pre-trained PaddleOCR models already understand various text styles, but fine-tuning them on your specific use cases - medicine packages and product cartons - will significantly improve accuracy.