Finding small text in image - best method?

stevew77 · June 30, 2019, 12:35pm

Hi

I’m new to python and machine vision.

I have a need to find a fairly short string of text ( 2-4 digits ) in a photo of a cows head ( its ID tag ) .

The text may be rotated slightly by up to +/- 20 degrees.

Then I need to extract the text from the photo.

The text could also have a thick black boundary around it, or it may not.

Does anyone have a suggestion as to an approach for this please?

Some people have suggested ROI and HOG, however as I’m not experienced in this I need general guidance initially.

I’m thinking I could maybe identify the shape of a cow, then zoom in on the cow head area so I’m in the general area where the text is, then find the text? How would I do that ( in broad terms? ) or is this a dumb idea?

Any suggestions welcome! I’m enjoying machine vision, however there is much to learn.

ptrblck · June 30, 2019, 3:07pm

I assume you already have the extracted text for the dataset.
It might be a good idea to use some detection model (deep learning or not) to find the text box in the first step and extract the text using these extracted text boxes.
Do you have the coordinates of the text boxes?

Also, could you post an image of a cow with the corresponding text?
I’m curious where the cows have some kind of text on their heads.

KFrank · June 30, 2019, 3:30pm

Hello Steve!

The general task you are asking about is called “object
detection.” There is a lot of literature about it that discusses
a number of well-developed methods and algorithms.

Let me suggest one standard approach. It isn’t necessarily
the “best,” but it is conceptually straightforward, and
relatively easy to implement. This effective, but somewhat
old-school method (that isn’t necessarily the most efficient)
runs as follows:

First build a binary object classifier – that is, is this object
X or not? – and then apply your object classifier to windows
(of various sizes) that you slide across your image.

In more detail:

Manually annotate your training data. That is, make a
set of scaled, centered, cropped images that contain
cow ids snipped out of your larger images. These are
your “positive” – “yes, cow id” training images. Also
make a similar set of scaled, cropped images snipped
out of your larger images – a cow’s tail, part of a tractor,
a bale of hay. These are your “negative” – “no cow id”
training images.

These should be smallish images. Depending on the typical
aspect ratio of a cow id, these might be, say, 32x16 pixel
images. I doubt that you will need to orient (rotate to the
horizontal) your cow ids – 20 degrees shouldn’t be a problem
for training a classifier.

Now train a binary cow-id classifier. (Since this is the
pytorch forum, we imagine that you train a simple
convolutional neural network – although a fully-connected
network would likely work fine – but you could also build
a non-neural-network classifier.)

To make your detector, slide windows of various sizes (and
possibly aspect ratios) across your image, and rescale the
contents of your window to the 32x16 size used to train your
classifier. When a window slides over a cow id, your classifier
should return “yes”, and you have a “hit.” When the window is
over something else, your classifier should return “no.”

Lastly, as your window slides across a cow id, you are
likely to get several hits for the same cow id. (Windows
of different sizes are also likely to give you duplicate
hits.) You want the best hit for any give cow id, so you
apply what is called “non-maximal suppression.” For
this you want your classifier to provide not just a “yes”
or “no,” but also the “probability” or “confidence”
underlying the “yes” or “no.” (Let’s say that your
network has a single output – a “logit” – that is fed
into BCEWithLogitsLoss when training your network.
You could use this logit as your confidence.) For
non-maximal suppression, you look at the confidence
of each of the hits in a cluster of neighboring hits,
and keep only the hit with the largest confidence.

The location of the sliding window, together with its
size (and possibly aspect ratio) for this largest-
confidence hit is the bounding box of your detected
cow id that you can then feed to further processing.

This isn’t a dumb idea – you would be using substantive
“semantic” structure of your images, that is, cow ids appear
on cow heads, and not on bales of hay, etc.

I would image that unless text appears in lots of other places
in your images – license plates on tractors, cow id labels
stuck to the sides of barns – just detecting the cow ids
directly will be simpler and easier to implement.

(Even if your images contain a lot of other text, as long as
the cow-id text is somehow distinctive – a certain font, thick
black borders – you could still train just on the cow ids,
provided you make sure to include representative non-cow-id
text in your “not” training samples.)

Good luck.

K. Frank

stevew77 · July 1, 2019, 6:33am

Hi Ptrblk, cows usually have identity tags connected to thier ears that has a 4 digit number of consistent shape and size.

Hi KFrank, thank you for your detailed “how to”. I will now go and research some more and no doubt may have a few questions. One of the things I find is that as its a big field of study, I’m trying to bolt together existing solutions to make it all work, although I think I will trash and need to rebuild my ubuntu laptop a few times.

Would you recommend re-imaging a laptop or maybe working within a virtual machine?

Cheers

Steve