Why does KL vanishing occur?

Hi, I am trying to implement the CVAE-based dialog model described in Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders.

I am confused that the different computation methods of the reconstruction error affect the KL divergence.

When I use the nll_loss function to obtain the mean of reconstruction error, the KL divergence gradually becomes 0, which causes the KL vanishing problem. The code is as below.

avg_recon_loss = F.nll_loss(outputs.contiguous( ).view(-1, outputs.size(-1)), targets.reshape(-1), ignore_index=self.pad_id)

The value of avg_recon_loss is between 4 and 2.

However, I borrow the method from NeuralDialog-CVAE-pytorch, which calculates the sum of the reconstruction error. As a result, the KL reaches a non-zero lower bound, such as 28.
The code is as bellow:

recontruction_loss = F.nll_loss(outputs.contiguous().view(-1, outputs.size(-1)), targets.reshape(-1), reduction='none')
recontruction_loss = recontruction_loss.view(outputs.size()[:-1])
label_mask = (targets != self.pad_id).float()
recontruction_loss = torch.sum(recontruction_loss*label_mask, 1)
avg_recon_loss = torch.mean(recontruction_loss)

The value of avg_recon_loss is between 77 and 20.