Latent distribution in transformer

I am using a caption-generating transformer. My goal is to train a regression model on the top of the transformer. For this reason I need to have access to the latent representation of the predicted text. I am trying to extract the latent(/ vector representation) of the text while training the transformer. I will store the vector representation of the text and eventually train the regression model. But I am not able to figure out how to extract the vector representation of the final sequence. Can you please help me with that?

You can either train a nn.Embedding layer or download a pretrained embedding layer.

