I want to build an object detector.
The YOLO models are too large for my use case.
In my use case there is only a single type of object to detect.
It is also guaranteed to appear once at most.
Hence I thought I could come up with a very simple and efficient model.
I was wondering, for such case, how would you design the model?
Would you choose YOLO style model with anchors? Which convolutions?
Which loss to take advantage there is only a single class?
I’d be happy if people shared there intuition to that specific use case.
In cases of computer vision it tends to include blocks of Conv2d->BatchNorm->Activation->Dropout2d-> this will learn filters (like these). Optimisation comes later.
One intends to pass information efficiently from the areas to the channels, reducing H-W and increasing C. This normally involves hyperparameter tuning.
For the head, you want to end with Flatten->Linear->Activation for single class prediction you can use a sigmoid function or two units and a softmax. But see this cool post about which one is best (tldr, use sigmoid.)
Another option is using a global max pool layer instead of the linear.
It’s also normally used x.view(-1, *size[1:]) for flatten instead of a layer. Not fully sure why not x.view(*size) but anyhow, that’s what I see sometimes.
What loss, any special data preparation?
For the loss for a classification, many times MSE is enough.
How to take advantage only a single class is needed?
Look at the link I provided last.
You may also want to use a random set of common images in your context without the class for teaching “not an instance of the class”. So if you want to detect dogs in a zoo, pass other animals, not dogs as “not a dog” may help.
What about building the model?
I’d like to avoid using anchors.
So basically 2-3 scales, each output a tensor of (h x w x 5) where the 3rd dimensions is:
Probability of detection (Classification).
(x, y) - shift from the top left of the pixel for the center of the box.
(width, height) - normalized width and height of the box.
I assume data is labeled in YOLO format.
Now, what I don’t get is how to implement the loss.
How the loss interacts with the NMS many models apply?
Should I apply the loss of the regression only for the cases the model has high probability of detection?
I actually replied about the loss for a classification task, not for a single object detection as in finding the bounding box.
For that, you can search tutorials, it’s I don’t quite remember it right now, but as a guess I think it involves several losses, for example 1 for the 4 coordinates, and another for the classification. The original YOLO actually had a pretty complicated loss.