Using log_prob in distributions

backpropper · June 16, 2018, 6:59am

Can someone please verify if the following use of log_prob fucntion is correct to calculate KL Divergence?

self.q_y = RelaxedOneHotCategorical(self.temperature, logits=logits)
self.log_q_y = self.q_y.log_prob(samples)
self.p_y = RelaxedOneHotCategorical(self.temperature, logits=prior)
self.log_p_y = self.p_y.log_prob(samples)
kl = torch.exp(self.log_q_y) * (self.log_q_y - self.log_p_y)

agadetsky · June 17, 2018, 12:35am

Your kl is not a KL divergence between q and p.
If you want to estimate KL divergence between q and p using sampling, then you just need to get samples from q and then calculate log q(sample) - log p(samples) and take average over number of samples. But computing KL between two relaxed onehots suffers from numerical instability, it can be better to compute KL between ExpRelaxedOneHotCategorical as KL is invariant under invertible transformations. For more information you can read https://arxiv.org/pdf/1611.00712.pdf

backpropper · June 17, 2018, 1:21am

Thanks for the help! So do you mean kl = log q(samples) - log p(samples)?
and then take average over the samples?

agadetsky · June 17, 2018, 1:34am

Yes, also I advice to read the paper to get more information and examples

backpropper · June 19, 2018, 12:49am

Yes I did read that paper and the Gumbel Softmax one as well. I was just confused about the way log_prob is used. I see that the KL can be defined in two ways:

Using the parameterized distribution where the samples are sampled from the q distribution.

kl = log_q_y - log_p_y

Using the logits directly

postlogprob = log(softmax(logits))
priorlogprob = log(softmax(prior))
kl= (poslogprob * samples) - (priorlogprob * samples)

Am I correct here?

agadetsky · June 19, 2018, 6:17am

The first case is right, it is just Monte Carlo estimation of the KL divergence.
The second case is false, because it is not an estimation, it is something weird. Usually logits or any parameters of the distributions can be used to compute KL divergence analytically. For example, if we have two Gaussian distributions with means mu_1 and mu_2 and variance sigma_1, sigma_2, then we can sample from them and compute KL using case 1. Or we can compute analytically and get formula dependent on these parameters of the Gaussians as here https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
In the case of RelaxedOneHotCategorical distribution you can not compute KL analytically, so you have several possible solution to estimate this KL. For further information read Appendix C here: https://arxiv.org/pdf/1611.00712.pdf

backpropper · June 19, 2018, 1:41pm

I don’t understand why the second method is wrong. I am actually trying to maximize ELBO which can be written as E[log p(x|z) + log p(z) - log q(z|x)] = E[log p(x|z)] - KL(q||p). Now this KL divergence can be written as the difference of the cross-entropy H(q,p) and the entropy H(q). So we can write H(q) as -postlogprob * samples and H(q,p) as -priorlogprob * samples where samples are sampled from the q distribution.
What’s wrong with this formulation?

I know we can compute KL analytically using a closed form formula for Univariate Gaussians.
I am actually confused as to how the KL divergence computed in this link is correct:
https://github.com/ericjang/gumbel-softmax/blob/master/Categorical%20VAE.ipynb (Cell 5)

backpropper · June 21, 2018, 4:51pm

Can someone please clarify this?

Diego999 · April 3, 2019, 6:14am

Did you figure a solution out finally ?

williamFalcon · February 2, 2020, 1:09pm

For normal likelihood and unit normal prior, i’m using the code below but it’s not working. Is there something wrong?

        z_mu, z_var = self.enc(x)
        
        # ---------
        # sample Z
        # ---------
        # init likelihood and prior
        std = torch.exp(z_var / 2)

        # Normal likelihood
        Q = torch.distributions.normal.Normal(z_mu, std)

        # Normal(0, 1) prior
        P = torch.distributions.normal.Normal(loc=torch.zeros_like(z_mu), scale=torch.ones_like(std))

        # sample Z
        z = Q.rsample()

        # KL div
        qz = Q.log_prob(z)
        pz = P.log_prob(z)

        kl_loss = torch.mean(qz - pz)

kfallah · June 16, 2020, 6:38pm

I am doing something similar with an exponential distribution. For some reason every element in my batch has the same scale and shift output by the encoder. Am I getting i.i.d. samples across each dimension if I sample like this? Note z.shape = [ kl_sample_count, batch_size, feature_size]:

prior = torch.distributions.normal.Normal(torch.zeros((batch_size, feature_size)), torch.ones((batch_size, feature_size)))
z = prior.rsample(torch.Size([kl_sample_count]))

EDIT: I have singled out the cause as the hidden layer equaling all zeros after the ReLU, causing every output to equal the bias of the final layer. But still not sure what’s causing this…