What I have to label for object detection with faster-rcnn?

Chame_call · February 24, 2021, 5:15pm

Hey.

My goal is to detect manufacturers on images. Most of time they are presented by logo + text label name and less often only by one of these. Most of them I can detect them with ocr as they often have good readable (for ocr) manufacturer names but from time to time I meet text labels having curly letters.

So the question is exactly about the latter case.

If I have a logo and a text label with curly letters should I add to dataset only logos or logos+label names?

I ask this question because I assume that logos+names have more size but probably it may be hard for network to distinguish words while only logos are more different from each other but small in size.

For example I have the following item.
Which labeling method is better in relation to network detection stability?

only logo

900385190_2 (copy 1)1000×1000 155 KB
logo + manufacturer name

900385190_21000×1000 156 KB

Which case is more valid?

ptrblck · February 26, 2021, 9:29am

I think you should try out both approaches and check which one works best for your specific use case.
My guess would be that the logo + text crop could work better, as the text also provides a lot of features the CNN should be able to extract.

Chame_call · February 26, 2021, 10:48am

Thank you for your opinion.
Two more related questions for which I would like to hear your guesses if possible:

I.
For example the above manufacturer (e.g. class A) can be represented by

only logo;
only text label;
logo + text;

So we have images of the category with all three possible variations.
Accordingly the three cases above we crop following areas:

logo as class A;
text as class A;
could we crop logo+text as class A or we should crop the logo as class A and the text as class A apart getting two bounding boxes?

In case I express myself incomprehensibly:
If for specified category we have two representation (logo and text) and in case we have only one of them we label only it itself so what we should do when these representations are next to each other - keep labeling them separately or combine them or we have to have only one representation for class and thus create two categories (for example class_A_logo and class_A_text)?

II.
If your answer on the previous question that we should combine near positioned boxes into one then there’s one more question?
Logo+text can be positioned in different ways (next to each other or one under one).
Does cnn translation invariance ability cope that difference in positioning?

Thanks

ptrblck · February 26, 2021, 10:39pm

It depends a bit on your current workflow. If you are detecting the logo and (logo + ) text separately, are you passing them to the same model or are you using a “logo” and another “text” model?
In the former case (one model only), you could try to detect overlapping bounding boxes or ones, which are close to each other, and let the same model create multiple predictions, which can then be either used to calculate the (weighted) prediction.
This workflow would of course be a bit more complicated than using a single input, so in case you would like to start simple (which I would recommend to do), you could try to use the largest box, which should include the logo + text, if possible.

Chame_call · February 27, 2021, 3:06pm

In that more simple case let’s say we crop box containing logo+text if they are present but can we crop boxes containing only logo or text or the point is to crop only their combined combination?

ptrblck · February 27, 2021, 10:30pm

In my “simple” use case I describe a single crop, so that a single model would also get only one input and predict the class.
If you have multiple crops, you would have to think about how multiple crops of the same image should be treated:

single model with multiple forward passes → prediction is a weighted mean?
multiple models → would this work for the use case or are the crops imbalanced and one model might not be trained enough
…

Chame_call · February 28, 2021, 2:32am

Sorry i’m a little confused.
Do you mean a classifier model by that or what?

I have faster-rcnn model and in training forward pass I pass image and all boxes for all classes which are presented in the image as ground truth.

Can you expalin please which point exactly are you chasing using a single crop as input?

ptrblck · February 28, 2021, 8:14am

I might be mistaken, but in the standard use case each target bounding box would represent a different object. I.e. while the class might be the same (e.g. two persons), the objects themselves would not be the same.
In your use case, however, the logo and text would both represent the same object and class.
I just wanted to point out to the case, where the model could predict classA for the logo and classB for the text. What would you do in such a case, i.e. what would the prediction be?

Chame_call · February 28, 2021, 1:23pm

It’s totally fine to represent different bboxes as different objects. The problem of identifying of object by its parts (logo, text) is out of scope of our topic)

The initial question in visual way:
For example we’ve got four images representing one class.
Third and fourth images are combinations of the first two. there can be much more such combinations.

We can label them in two ways.

labeling text and logo for the same class separately:

img1800×600 41.8 KB

img2800×600 46.3 KB

img3800×600 62 KB

img4800×600 55.8 KB
labeling near positioned text/logo boxes as single box + labeling logo and text as one class (just class_a opposite to class_a_logo and class_a_text):

img1 (copy 1)800×600 37.4 KB

img2 (copy 1)800×600 42.6 KB

img3 (copy 1)800×600 54.1 KB

img4 (copy 1)800×600 51.7 KB
.

The question is which labeling way is preffered?
At the time I lean towards the first case as it is universal while the second might be a bad idea because in one case we label only the logo as class_a, in the other case logo+text as class_a.
I think this might “confuse” our model.

Аfter all this i just would like to hear your opinion on the above)