I’m used to reading papers and seeing code (like https://github.com/pytorch/examples/tree/master/vae) with KL divergence which is minimized via ELBO, but recently I looked at an implementation of neural processes which does the same thing, and an implementation just uses the
kl_divergence from pytorch (https://github.com/EmilienDupont/neural-processes/blob/master/training.py#L130)
I took a look at the source code and docs and there wasn’t much information there besides the basics of what kl is. It looks like the source code might be sampling the distribution somehow…
Does computing loss with kl divergence and then back propagating it remove the need to derive ELBO and reparameterize altogether?
I’d like to know the answer as well. One thing that pops up is that VAEs use reparameterisation in order to avoid the autoencoder to learn the identity function, I’m not sure if neural processes suffer from the same problem (I haven’t read in detail)?
Another aspect is that VAEs are doing something like binary classification but on the distribution level not sure if neural process try to tackle the problem from the same perspective.
Finally, I’ve also encountered numerous papers that use the KL and FWIW I’m not fully convinced that empirically at least KL offers anything beyond what you’ll get with cross entropy.
IDK what I was really asking in the original question. KL is just a autograd enabled version of the KL divergence for the given distributions. the kl_divergence function does nothing for reparameterization. Reparameterization happens by manually computing it or using the
rsample() function which is present on many of the distributions in Pytorch.
Also, KL doesn’t replace CE, its just an added regularization term on whatever distribution is being reparameterized to keep it from drifting too far from the prior.
Thanks, I agree that KL doesn’t replace CE, I’m just questioning whether “empirically” is really contributing to the final model? E.g. have you seen any results/differences between a vae with and without a KL regularisation term? Is the final model with kl regularisation orders of magnitude better than the one without or is it just negligible difference?
When I make a model which requires KL, I usually run it once while testing without the KL to make sure everything is hooked up and working correctly. What I have seen there is that without the KL, the variance of whatever distribution that is being reparameterized collapses to look more deterministic.
It probably depends on the specific data and problem as to whether this would help or hurt. But if the variance does go to zero and it helps, then you probably don’t need a reparamterized model in the first place would be my guess.