Optimization vs architecture

Which factor is bigger when determining SOTA results with object detectors, optimization or architecture. For example, detectors like YOLO and efficientdet have achieved results with their respective optimizations used in the paper. Could better results be achieved solely by modifying optimization using the same architecture?

I am wondering if research in that area is worth the effort. My intuition tells me that it all comes down to an optimization problem but i would like outside input