Vgg with or without batch norm seems exrremely different

When I’m coding on a binary segmentation task, background as zero and foreground as one.I choose vgg16 with or without batch normalization as my network backbone.Both are pre trained on Imagenet. And I use 1 image to overfit this network to test its performance.The result is weird and I can’t figure it out.
The with version, is able to produce a segmentation map,but the without one, produces all zero’s prediction.
Seems the without version is not tranferrable ?
Why it happens? I use weighted cross entropy as loss function and resizedcrop as data augmentation method,And I tried to remove weight and augmentation, it’s not better.I’m confused.
Network Detail:
My network are simple,encoder is vgg16,and I use Bilinear upsample ,Conv2d(512,2) and Sigmoid to transfer the feature to segmentation map directly. this method always works well while overfitting performance test.
the with version result is coarse but at least it produces something positive. the without one produces only black.

one the other hand, loss is constant in without version, but descend then keep constant in with version

Could you try to play around with some hyperparameters (learning rate, weight init etc.)?
Your model without BatchNorm layers might be more sensitive to the input distribution.
Are you normalizing the image?

Did you solve? I have the same problem using a simpler VGG, on MNIST, so background black and numbers white