Transformer model to encode the set of sentences

Here is the code that attempts to learn to encode for paragraphs (the set of sentences that are themselves encoded).

class TransformerEnc(nn.Module):
    def __init__(self):
        super(TransformerEnc, self).__init__()
        d_model = 1024
        self.posenc = PositionalEncoding(d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead=8)
        self.model = nn.TransformerEncoder(encoder_layer, num_layers=6)

    def forward(self, x):
        x = self.posenc(x)
        output = self.model(x)
        output = torch.sum(output, 1)
        return output

forward function gets tensor x of shape [64, 20, 1024].
Where 20 is the max number of sentences in a paragraph and 1024 is encoded sentence.
Now I need the output of the size [64, 1024] where 20 sentences got embedded in one 1024 dim vector.

When I train a model with this embedding I get a random accuracy. If I replace this code with LSTM it trains well. So clearly there is some issue.
Any ideas what might be the problem?