Your kl is not a KL divergence between q and p.
If you want to estimate KL divergence between q and p using sampling, then you just need to get samples from q and then calculate log q(sample) - log p(samples) and take average over number of samples. But computing KL between two relaxed onehots suffers from numerical instability, it can be better to compute KL between ExpRelaxedOneHotCategorical as KL is invariant under invertible transformations. For more information you can read https://arxiv.org/pdf/1611.00712.pdf

Yes I did read that paper and the Gumbel Softmax one as well. I was just confused about the way log_prob is used. I see that the KL can be defined in two ways:

Using the parameterized distribution where the samples are sampled from the q distribution.

The first case is right, it is just Monte Carlo estimation of the KL divergence.
The second case is false, because it is not an estimation, it is something weird. Usually logits or any parameters of the distributions can be used to compute KL divergence analytically. For example, if we have two Gaussian distributions with means mu_1 and mu_2 and variance sigma_1, sigma_2, then we can sample from them and compute KL using case 1. Or we can compute analytically and get formula dependent on these parameters of the Gaussians as here https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
In the case of RelaxedOneHotCategorical distribution you can not compute KL analytically, so you have several possible solution to estimate this KL. For further information read Appendix C here: https://arxiv.org/pdf/1611.00712.pdf

I don’t understand why the second method is wrong. I am actually trying to maximize ELBO which can be written as E[log p(x|z) + log p(z) - log q(z|x)] = E[log p(x|z)] - KL(q||p). Now this KL divergence can be written as the difference of the cross-entropy H(q,p) and the entropy H(q). So we can write H(q) as -postlogprob * samples and H(q,p) as -priorlogprob * samples where samples are sampled from the q distribution.
What’s wrong with this formulation?

I am doing something similar with an exponential distribution. For some reason every element in my batch has the same scale and shift output by the encoder. Am I getting i.i.d. samples across each dimension if I sample like this? Note z.shape = [ kl_sample_count, batch_size, feature_size]:

prior = torch.distributions.normal.Normal(torch.zeros((batch_size, feature_size)), torch.ones((batch_size, feature_size)))
z = prior.rsample(torch.Size([kl_sample_count]))

EDIT: I have singled out the cause as the hidden layer equaling all zeros after the ReLU, causing every output to equal the bias of the final layer. But still not sure what’s causing this…