TransformerEncoder and TransformerDecoder based text encoder decoder

I wanted to build text encoder-decoder based on nn.TransformerEncoder and nn.TransformerDecoder.
Given sentence to encoder, I want to get a vector representation of size 270 and in the decoder I want to extract the given text using the vector representation.

class TransformerEncoder(nn.Module):
  def __init__(self, input_size, num_head, hidden_size, num_layers):

    super(TransformerEncoder, self).__init__()

    self.embd = nn.Embedding(word_count,input_size)
    encoder_layer = nn.TransformerEncoderLayer(input_size, num_head, hidden_size)
    self.transformer_enc = nn.TransformerEncoder(encoder_layer, num_layers)
    self.linear1 = nn.Linear(input_size,270)

def forward(self, x):
    x = x.long()
    emb = self.embd(x)
    mem = self.transformer_enc(emb)
    out = self.linear1(mem)
    return out, mem
class TransformerDecoder(nn.Module):
  def __init__(self, input_size, num_head, output_size, hidden_size, num_layers):
    super(TransformerDecoder, self).__init__()

    decoder_layer = nn.TransformerDecoderLayer(input_size, num_head, hidden_size)
    self.transformer_dec = nn.TransformerDecoder(decoder_layer, num_layers)
    self.linear1 = nn.Linear(input_size, output_size)

  def forward(self, x, mem):
    out = self.transformer_dec(x, mem)
    out = self.linear1(out)
    return out

I am wondering if my implementation is correct because I have not used positional encoding and masking. Is it mandatory to use positional encoding and masking when using Transformer? Is my implementation correct?

Masking is task-dependant. It’s usually used in autoregressive transformers to speed up training. If you just want to make an encoder-decoder, you don’t need it. However I doubt you are gonna obtain something useful. The trick of autoencoders is the bottleneck is a compressed representation of the input. In your case, your latent space is going to have as many elements as your input sentence, thus, it’s very unlikely it needs to compress the information.

wrt the positional encoder, yes.

Without positional encoding (from transformer’s point of view) it’s the same passing
“I live in the south of Vietnam” as "Vietnam in live of I south the " or any other permutation. Basically the positional encoding is what allows the network to know the order of the sequence.

Hi @JuanFMontesinos thank you for the answer.

“your latent space is going to have as many elements as your input sentence, thus, it’s very unlikely it needs to compress the information.” How can you infer this?

Is this because of this in the encoder?:

mem = self.transformer_enc(emb)
out = self.linear1(mem)

If so how can I get the useful representations?

Also, I am passing final memory of encoder to the decoder which I think shouldn’t be the case in Auto-encoder, is there any alternative to this?

Transformers are sequence to sequence models.
If your sequence is composed by 100 elements (vectors), your output will be composed of 100 elements as well.

I’m not super familiarized with NLP. Some options are checking the word2vec (embedding generators). They generate meaninful vectors laying on a manifold from words. Maybe you can extend that to sentences and get profit of the pretrained weights.

Other options are applying a temporal pooling to “mem” or some sort of postprocessing that reduces the dimensionality until you have a bottleneck. Then train an autoregressive decoder to infer the sentence from there.