How to implement VAE loss (Cosine similarity and KLD)

I am unsure about how to implement correctly the KLD loss of my Variational Auto Encoder.

I work with input tensors of shape (batch_size, 256, 768), and at the bottleneck/latent dim of the VAE the tensors are of shape (batch_size, 32, 8) which are flattened by the FC layers for mu and log_var calculations to (batch_size, 256).
The ending of input tensors are padded by zeros because the input tensors have different sequence lengths.
My reconstruction loss is a cosine similarity loss calculated between the input tensors and the reconstructed tensors defined as:

def masked_cosine_similarity_loss(output: torch.Tensor, target: torch.Tensor, masks:torch.Tensor):
    masks = masks[:, :, 0]
    cos = nn.CosineSimilarity(dim=2)
    cos_loss = cos(output, target)
    cos_loss = 1.0 - cos_loss
    cos_loss *= masks
    cos_loss = torch.sum(cos_loss) / torch.sum(masks)
    return cos_loss

And my KLD loss is calculated on the latent dim vectors of shape (batch_size, 256), defined as:

KLD_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())

Are the lossed calculated correctly ? Knowing that the sequence dimension is 32 and features dim is 8 (in the latent dim), should I reshape the mu and log_var tensors from (batch, 256) to (batch, 32, 8) before calculating the KLD ? And if yes, should it be calculated on a specific dimension and normalized somehow ?
Also, the cosine similarity is calculated on padded tensors, even though I multiply by the masking just after, I guess it is not really the best way to calculate it but I didn’t find anything on the net that does this ?

did you get to make it work?