Compute latent representation of a sentence with a transformer

Hi, I am trying to build a variational autoencoder with pytorch and would like to use a transformer for the encoder and the decoder. But I’m not sure how to do that.
I have implemented the tokenization of my dataset, meaning that each word of a sentence is translated into a number.
The input of my encoder would be a tensor of shape (batch_size, sent_length) containing integers between 0 and the number of different words I have in my dataset.

How can I build a transformer that creates from this input a latent representation of shape (batch_size, latent_dim)? Is there any tutorial for this use-case?

I am wondering if you use a TransformerEncoder followed by a TransformerDecoder, will it help your case?
The output of the TransformerEncoder could be the latent representation in this case. This paper maybe useful.