Better model than CNN and Attension on image object detection?

Way_New · October 6, 2025, 12:27pm

There are some images and corresponding annotations. Under some transforms on image the labels are the same.
How to design a good model with good accuracy and fast speed?
The current model is CNN and Attesion, training by gradient decent.
I have some experiences on using UNets with Conv(kernel=3,padding=1), Maxpool(kernel=2,stride=2) and upsampling fusion, it’s better than one conv and one Mamba linear state space layer and not much slow.