I am unsure about how to implement correctly the KLD loss of my Variational Auto Encoder.

I work with input tensors of shape (batch_size, 256, 768), and at the bottleneck/latent dim of the VAE the tensors are of shape (batch_size, 32, 8) which are flattened by the FC layers for mu and log_var calculations to (batch_size, 256).

The ending of input tensors are padded by zeros because the input tensors have different sequence lengths.

My reconstruction loss is a cosine similarity loss calculated between the input tensors and the reconstructed tensors defined as:

```
def masked_cosine_similarity_loss(output: torch.Tensor, target: torch.Tensor, masks:torch.Tensor):
masks = masks[:, :, 0]
cos = nn.CosineSimilarity(dim=2)
cos_loss = cos(output, target)
cos_loss = 1.0 - cos_loss
cos_loss *= masks
cos_loss = torch.sum(cos_loss) / torch.sum(masks)
return cos_loss
```

And my KLD loss is calculated on the latent dim vectors of shape (batch_size, 256), defined as:

```
KLD_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
```

Are the lossed calculated correctly ? Knowing that the sequence dimension is 32 and features dim is 8 (in the latent dim), should I reshape the mu and log_var tensors from (batch, 256) to (batch, 32, 8) before calculating the KLD ? And if yes, should it be calculated on a specific dimension and normalized somehow ?

Also, the cosine similarity is calculated on padded tensors, even though I multiply by the masking just after, I guess it is not really the best way to calculate it but I didnâ€™t find anything on the net that does this ?