Optimizing Non-Maximum Suppression with Hybrid CUDA Kernels

I implemented a hybrid NMS kernel that optimizes performance when the number of boxes exceeds 2,000. However, for smaller inputs, the performance is slightly worse than the baseline. I am now considering whether to keep the hybrid design or simplify the implementation by focusing on a single kernel. Could anyone give me some guidance on this?