What can be the reasons of BatchNorm working and Dropout not working in YoloV1 Pytorch implementation?

Hi. I tried to implement YoloV1 in PyTorch using their paper. Eventually, I got in a situation, when model trained for some time decreasing the loss, but was stuck in middle values before starting point and 0.

I then found Aladdin Persson implementation (which he described in YouTube video, specific moment regarding BatchNorm here). He said that original paper used Dropout, because BatchNorm was not invented at the time, and he wants to use BatchNorm instead. I thought there is no critical difference between these two, and decided to stick up with paper for the sake of learning to implement such things.

But my model still couldn’t even overfit. After some investigation, I figured out that only real difference between my implementation and his is using BatchNorm: if I replace BatchNorm with any Dropout bigger than 0.0 in CNNBlock - model starts learning worse. It converges later then normal if Dropout P is 0.1-0.2 and stops converging at all if this P is bigger than 0.2.

Could someone please explain how this works? Should I not use Dropout ever, because BatchNorm is better, or both options have pros and cons?

Specifically to YoloV1 - may be Dropout worked only in their specific learning pipeline with a lot of data and pre-training?