Input normalization in VAE

I am training a special case of a VAE on MNIST and an additional dataset. Whenever I scale the inputs so that they stay in the [0, 1] range, training fails miserably with the decoder generating the same image over and over again. At the same time, standardizing the data with the means and standard deviations of the respective datasets results in relatively rapid progress in training. This is a bit surprising to me as the last layer of my decoder is a sigmoid function which maps to [0, 1] rather than (-inf, +inf) - the range we theoretically get after standarization. The way I understand VAE, the encoder and decoder should “mirror” each other’s structure, but in my case there is a clear mismatch between the input and what the decoder can actually generate. I am using a binary cross entropy loss function (BCELoss) for the reconstruction loss. Am I missing something fairly obvious here?