Efficient single object detector

AviZ · December 12, 2024, 6:33pm

I want to build an object detector.
The YOLO models are too large for my use case.

In my use case there is only a single type of object to detect.
It is also guaranteed to appear once at most.

Hence I thought I could come up with a very simple and efficient model.

I was wondering, for such case, how would you design the model?
Would you choose YOLO style model with anchors? Which convolutions?
Which loss to take advantage there is only a single class?

I’d be happy if people shared there intuition to that specific use case.

Aknw_Fen · December 12, 2024, 6:50pm

The smallest YOLOX is ~2Mb after exporting to onnx (with fp16 I believe), did you try it?

It’s not that simple run though, from the issues.

You can also based your network off theirs, to answer your questions.

AviZ · December 18, 2024, 12:37pm

I will give it a try.
My purpose is to learn by doing, so I’d appreciate ideas.

Like what guidelines make a detector efficient?
What loss, any special data preparation?
How to take advantage only a single class is needed?

Aknw_Fen · December 18, 2024, 1:24pm

Like what guidelines make a detector efficient?

In cases of computer vision it tends to include blocks of Conv2d->BatchNorm->Activation->Dropout2d-> this will learn filters (like these). Optimisation comes later.

To keep sizes manageable, you may want torchinfo.summary.

One intends to pass information efficiently from the areas to the channels, reducing H-W and increasing C. This normally involves hyperparameter tuning.

For the head, you want to end with Flatten->Linear->Activation for single class prediction you can use a sigmoid function or two units and a softmax. But see this cool post about which one is best (tldr, use sigmoid.)

Another option is using a global max pool layer instead of the linear.

It’s also normally used x.view(-1, *size[1:]) for flatten instead of a layer. Not fully sure why not x.view(*size) but anyhow, that’s what I see sometimes.

What loss, any special data preparation?

For the loss for a classification, many times MSE is enough.

How to take advantage only a single class is needed?

Look at the link I provided last.

You may also want to use a random set of common images in your context without the class for teaching “not an instance of the class”. So if you want to detect dogs in a zoo, pass other animals, not dogs as “not a dog” may help.

AviZ · December 18, 2024, 1:50pm

Great tips.

What about building the model?
I’d like to avoid using anchors.
So basically 2-3 scales, each output a tensor of (h x w x 5) where the 3rd dimensions is:

Probability of detection (Classification).
(x, y) - shift from the top left of the pixel for the center of the box.
(width, height) - normalized width and height of the box.

I assume data is labeled in YOLO format.
Now, what I don’t get is how to implement the loss.
How the loss interacts with the NMS many models apply?
Should I apply the loss of the regression only for the cases the model has high probability of detection?

Aknw_Fen · December 18, 2024, 5:00pm

I actually replied about the loss for a classification task, not for a single object detection as in finding the bounding box.

For that, you can search tutorials, it’s I don’t quite remember it right now, but as a guess I think it involves several losses, for example 1 for the 4 coordinates, and another for the classification. The original YOLO actually had a pretty complicated loss.