I have seen people writing the reconstruction loss in two different ways:
F.binary_cross_entropy(recon_x1, x1.view(-1, 784)) or F.binary_cross_entropy(recon_x1, x1.view(-1, 784), reduction = "sum")
I was wondering if there is a theoretical reason to use one over the other?
Hi @Rojin
I believe this comes from the fact that the KL divergence is an integral/sum.
So the sum reduction would be the more paper faithful approach, but having the PyTorch default (mean) will still work (only have downscaled gradients).