I have seen people writing the reconstruction loss in two different ways:

F.binary_cross_entropy(recon_x1, x1.view(-1, 784)) or F.binary_cross_entropy(recon_x1, x1.view(-1, 784), reduction = "sum")

I was wondering if there is a theoretical reason to use one over the other?

Hi @Rojin

I believe this comes from the fact that the KL divergence is an integral/sum.

So the sum reduction would be the more paper faithful approach, but having the PyTorch default (mean) will still work (only have downscaled gradients).