I have been working on a binary segmentation task for months. I tried everything. My architecture was able to overfit a small dataset but never could converge on the larger one. I even tried a pretrained encoder, but it also didn’t work, tried tuning other hyper-paramters, used learning rate schedulers and they also didn’t work. Today I tried Normalizing the output with mean 0 and std 0.5 and it worked although normalizing with mean 0.5 and std 0.5 does not work and the loss starts diverging. I wanted to ask why normalizing the output has such a drastic effect and how can I find these value in future for other problems.
That’s because all usual weight initializers are trying to preserve mean & variance, but it is only possible if input moments are known - so they assume normalized inputs (0,1). With input mean != 0, depending on layer and weight shape, layer output’s mean can be pushed far away from zero, variance may increase, therefore you can get a bunch of well-known problems (saturation, flat loss region, etc)
If you mean that final output normalization helped you, this again means that before normalization network outputs were in a bad region (i.e. mean 100)
What can I do so that these problems occur less often and how can I know which value of mean and standard deviation would work?
hidden layers usually work best with mean=0,sd=1, final output shouldn’t be normalized (at least, not without affine=True) as your loss target should dictate best mean.
What about the pixel values of the output, should they be in range 0 to 1 or -1 to 1
inside network, you have pixels->features->features->predictors pipeline, so no reason to use 0…1 range