BatchNorm2D folding/fusion

I have been trying to improve performance of a resnet50 based model for deployment and after testing scripting the pretrained model for JIT compilation and not finding any improvement from it, I am now also testing fusing of the BatchNorm layers. After successfully fusing the model, I am also not seeing any improvement from fusing 50 of the layers (verified from printing out the models before and after). This is on pytorch 1.7.1/cuda11.2 It is making me wonder if pytorch is already optimizing things so well that these tricks are no longer effective or necessary. Anyone experienced this?