Make nn.Transformer work for Text Generation

Hi. I am trying to make a Transformer work for paraphrase generation but the generations are not useful (the same everytime, full of BOS tokens or “?” tokens).
I followed this tutorial for reference. My implementation is embedded into a framework which requires an Encoder and a Decoder:

The encoder is like this:

class TransformerEncoder(nn.Module):
    def __init__(
        self,
        vocab_size,
        pad_token_id=None,
        embedding_size=256,
        num_heads=8,
        num_layers=3,
        ffnn_size=512,
        dropout=0.1,
    ):
        super(TransformerEncoder, self).__init__()
        self.vocab_size = vocab_size
        self.pad_token_id = pad_token_id

        self.embedding_size = embedding_size
        self.num_heads = num_heads
        self.num_layers = num_layers
        self.ffnn_size = ffnn_size

        self.embed_tokens = TokenEmbedding(vocab_size, embedding_size)
        self.embed_positions = PositionalEmbedding(embedding_size, dropout=dropout)

        encoder_layer = nn.TransformerEncoderLayer(
            embedding_size,
            num_heads,
            ffnn_size,
            dropout,
        )
        encoder_norm = nn.LayerNorm(embedding_size)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers, encoder_norm)

    def forward(
        self,
        input_ids,
    ):

        # seq_len = input_ids.shape[1]
        # device = next(self.parameters()).device

        embedded_tokens = self.embed_positions(self.embed_tokens(input_ids))
        # B x T x C -> T x B x C
        embedded_tokens = embedded_tokens.transpose(0, 1)

        memory = self.encoder(embedded_tokens)

        return (memory,)

The decoder is like this:

class TransformerDecoder(nn.Module):
    def __init__(
        self,
        vocab_size,
        pad_token_id=None,
        embedding_size=256,
        num_heads=8,
        num_layers=3,
        ffnn_size=512,
        dropout=0.1,
    ):

        super(TransformerDecoder, self).__init__()
        self.vocab_size = vocab_size
        self.pad_token_id = pad_token_id

        self.embedding_size = embedding_size
        self.num_heads = num_heads
        self.num_layers = num_layers
        self.ffnn_size = ffnn_size

        self.dropout_module = nn.Dropout(p=dropout)

        self.embed_tokens = TokenEmbedding(vocab_size, embedding_size)
        self.embed_positions = PositionalEmbedding(embedding_size, dropout=dropout)

        decoder_layer = nn.TransformerDecoderLayer(
            embedding_size, num_heads, ffnn_size, dropout
        )
        decoder_norm = nn.LayerNorm(embedding_size)
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers, decoder_norm)
        self.fc_out = nn.Linear(embedding_size, vocab_size)

    def forward(
        self,
        input_ids,
        encoder_out,
    ):
        seq_len = input_ids.shape[1]

        device = next(self.parameters()).device
        mask = generate_square_subsequent_mask(seq_len).to(device)

        embedded_tokens = self.embed_positions(self.embed_tokens(input_ids))

        # B x T x C -> T x B x C
        embedded_tokens = embedded_tokens.transpose(0, 1)

        output = self.decoder(embedded_tokens, encoder_out[0], tgt_mask=mask)

        # T x B x C -> B x T x C
        output = output.transpose(1, 0)

        return (self.fc_out(output),)

TokenEmbedding and PositionalEmbedding are as in the tutorial.
The main model just invokes encoder and decoder like:

        encoder_outputs = self.encoder(input_ids=input_ids, **kwargs)

        decoder_outputs = self.decoder(
            input_ids=decoder_input_ids,
            encoder_out=encoder_outputs,
            **kwargs,
        )

The loss is going down, but the generations are real bad. Following is an example of the generations:
Source: < s > Can I jailbreak iOS 10 ? < /s > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad >
Preds: < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s > < s >
Target: < s > Can you jailbreak iOS 10 ? < /s > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad > < pad >

As you can see, the predictions in this case are only BOS tokens. The output of the decoder on each decoder step is always almost the same for every iteration. The model does not seem to be learning. I have tried learning rates from 0.1 to 1e-4.

Do you have an intuition on what might be wrong? Sorry for the question not being self-contained. Thanks in advance for any help you can provide.