Using log_prob in distributions

Can someone please verify if the following use of log_prob fucntion is correct to calculate KL Divergence?

self.q_y = RelaxedOneHotCategorical(self.temperature, logits=logits)
self.log_q_y = self.q_y.log_prob(samples)
self.p_y = RelaxedOneHotCategorical(self.temperature, logits=prior)
self.log_p_y = self.p_y.log_prob(samples)
kl = torch.exp(self.log_q_y) * (self.log_q_y - self.log_p_y)

Your kl is not a KL divergence between q and p.
If you want to estimate KL divergence between q and p using sampling, then you just need to get samples from q and then calculate log q(sample) - log p(samples) and take average over number of samples. But computing KL between two relaxed onehots suffers from numerical instability, it can be better to compute KL between ExpRelaxedOneHotCategorical as KL is invariant under invertible transformations. For more information you can read https://arxiv.org/pdf/1611.00712.pdf

Thanks for the help! So do you mean kl = log q(samples) - log p(samples)?
and then take average over the samples?

1 Like

Yes, also I advice to read the paper to get more information and examples

Yes I did read that paper and the Gumbel Softmax one as well. I was just confused about the way log_prob is used. I see that the KL can be defined in two ways:

  1. Using the parameterized distribution where the samples are sampled from the q distribution.
kl = log_q_y - log_p_y
  1. Using the logits directly
postlogprob = log(softmax(logits))
priorlogprob = log(softmax(prior))
kl= (poslogprob * samples) - (priorlogprob * samples)

Am I correct here?

The first case is right, it is just Monte Carlo estimation of the KL divergence.
The second case is false, because it is not an estimation, it is something weird. Usually logits or any parameters of the distributions can be used to compute KL divergence analytically. For example, if we have two Gaussian distributions with means mu_1 and mu_2 and variance sigma_1, sigma_2, then we can sample from them and compute KL using case 1. Or we can compute analytically and get formula dependent on these parameters of the Gaussians as here https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
In the case of RelaxedOneHotCategorical distribution you can not compute KL analytically, so you have several possible solution to estimate this KL. For further information read Appendix C here: https://arxiv.org/pdf/1611.00712.pdf

I don’t understand why the second method is wrong. I am actually trying to maximize ELBO which can be written as E[log p(x|z) + log p(z) - log q(z|x)] = E[log p(x|z)] - KL(q||p). Now this KL divergence can be written as the difference of the cross-entropy H(q,p) and the entropy H(q). So we can write H(q) as -postlogprob * samples and H(q,p) as -priorlogprob * samples where samples are sampled from the q distribution.
What’s wrong with this formulation?

I know we can compute KL analytically using a closed form formula for Univariate Gaussians.
I am actually confused as to how the KL divergence computed in this link is correct:
https://github.com/ericjang/gumbel-softmax/blob/master/Categorical%20VAE.ipynb (Cell 5)

Can someone please clarify this?

Did you figure a solution out finally ?

For normal likelihood and unit normal prior, i’m using the code below but it’s not working. Is there something wrong?

        z_mu, z_var = self.enc(x)
        
        # ---------
        # sample Z
        # ---------
        # init likelihood and prior
        std = torch.exp(z_var / 2)

        # Normal likelihood
        Q = torch.distributions.normal.Normal(z_mu, std)

        # Normal(0, 1) prior
        P = torch.distributions.normal.Normal(loc=torch.zeros_like(z_mu), scale=torch.ones_like(std))

        # sample Z
        z = Q.rsample()

        # KL div
        qz = Q.log_prob(z)
        pz = P.log_prob(z)

        kl_loss = torch.mean(qz - pz)

I am doing something similar with an exponential distribution. For some reason every element in my batch has the same scale and shift output by the encoder. :thinking: Am I getting i.i.d. samples across each dimension if I sample like this? Note z.shape = [ kl_sample_count, batch_size, feature_size]:

prior = torch.distributions.normal.Normal(torch.zeros((batch_size, feature_size)), torch.ones((batch_size, feature_size)))
z = prior.rsample(torch.Size([kl_sample_count]))

EDIT: I have singled out the cause as the hidden layer equaling all zeros after the ReLU, causing every output to equal the bias of the final layer. But still not sure what’s causing this…