Confusing about the dimension of Seq2Seq model

I am new to Seq2Seq and hope to find a proper guildances, advices.

I am doing a Project from an online course so I can not give the material but I got my Project notebook on Github

I want to ask about my understanding about the architecture as well as the data dimension after each layer. Suppose I have a Seq2Seq model as below:

 Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(5678, 512)
    (lstm): LSTM(512, 512, batch_first=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(4297, 512)
    (lstm): LSTM(512, 512, batch_first=True)
    (fc): Linear(in_features=512, out_features=4297, bias=True)
    (dropout): Dropout(p=0.2, inplace=False)
    (softmax): LogSoftmax(dim=1)
  )
)

Where 5678 is source_vocab size, 512 is desired embedding size, 4297 is target_vocab size. You can check my Encoder, Decoder, Seq2Seq class as below:

device= torch.device("cuda" if torch.cuda.is_available() else "cpu")
#Force cpu
#device= "cpu"
print(device)

class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        
        super(Encoder, self).__init__()
        
        self.input_size= input_size
        self.hidden_size= hidden_size
        
        self.embedding= nn.Embedding(self.input_size, self.hidden_size)
        self.lstm= nn.LSTM(self.hidden_size, self.hidden_size, batch_first= True)

    def forward(self, i):
        print(i.size())
        embedded= self.embedding(i)
        print(embedded.size())
        o,(h,c)= self.lstm(embedded)
        
        return h, c
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size):
        
        super(Decoder, self).__init__()
        
        self.hidden_size= hidden_size
        self.output_size= output_size

        self.embedding= nn.Embedding(self.output_size, self.hidden_size)
        self.lstm= nn.LSTM(self.hidden_size, self.hidden_size, batch_first= True)
        self.fc = nn.Linear(self.hidden_size, self.output_size)
        self.dropout= nn.Dropout(0.2)
        self.softmax= nn.LogSoftmax(dim= 1)
        
    def forward(self, i, h, c):
        embedded= self.embedding(i)
        o,(h,c)= self.lstm(embedded, (h, c))
        o= self.fc(o[0])
        o= self.dropout(o)
        o= self.softmax(o)
        
        return o, h, c
        
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size):
        
        super(Seq2Seq, self).__init__()
        
        self.input_size= encoder_input_size
        self.hidden_size= encoder_hidden_size
        self.output_size= decoder_output_size
        
        self.encoder= Encoder(self.input_size, self.hidden_size)
        self.decoder= Decoder(self.hidden_size, self.output_size)
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        output_seq= []
        
        encoder_hidden, encoder_cell= self.encoder(src)
        
        decoder_hidden= encoder_hidden
        decoder_cell= encoder_cell

        decoder_input= torch.Tensor([[target_vocab.token_to_index("<SOS>")]]).long().to(device)
        
        for time_step in range(trg.size(0)):
            output_token, decoder_hidden, decoder_cell= self.decoder(
                decoder_input,
                decoder_hidden,
                decoder_cell
            )
            output_seq.append(output_token)
            
            if self.training:
                if random.random() < teacher_forcing_ratio:
                    decoder_input= trg[time_step]
            else:
                _, top_index= output_token.data.topk(1)
                decoder_input= top_index.squeeze().detach()
        
        return output_seq

My quesions is that the Input of Encoder is the soure vocabualry size, which mean that each token in the input sequence should be converted into one-hot vector before parsing to the Encoder before hand (for example, should a batch has a dimension of (batch_size, seq_len, vocab_size) instead of (batch_size, seq_len)?

I search others notebook and saw that they just parse a batch of (batch_size, seq_len) into the Encoder and I got confused.

Any help is appreciated.

I have tried pass (batch_size, seq_len, vocab_size) and the Embedding layer output dimension is (batch_size, seq_len, vocab_size, embedding_dim), which make me more confused, isn’t it should be (batch_size, seq_len, embedding_dim)

Yes, the input for the encoder is (batch_size, seq_len).

Each sequence in a batch is a list/array of integers reflecting the indices of the tokens in the vocabulary. For example, a match might look like this:

[
    [12, 40, 8, 105, 86, 6],
    [35, 105, 86, 35, 40, 6]
]

Representing the 2 sentences “i like to watch movies .” and “you watch movies you like .” This means your vocabulary provides a mapping like

{6: ".", 8: "to", 12: "i", 35: "you", 40: "like", ...}

There is no need to convert the tokens / token indices into one-hot vector. This is what the nn.Embedding layer is for. To clarify, this layer does not create one-hot vectors either, but accepts individual token indices as input.

You only need to appreciate that, say, token index 40 and a vocabulary size of 5678 carries the same information as a one-hot vector of size 5678 with a 1 at index 40. You can also check out this post.

Thank your for your reply,

The reason make I confuse is that when implement this layer, we do:
nn.Embedding(vocab_size, embedding_dim)

But when we parse input to this layer we input (batch_size, seq_len), there is no vocab_size here. But after read your answer and the post you provide, I guess that the reason behind giving the Embedding layer vocab_size is to let it know the maximum index range so that it can create embedding vectors that cover all the index?

It’s kind of there but only implicitly :). As mentioned earlier, each sequence in a batch is a list of token indices. This indices are between 0 and vocab_size-1. This is why nn.Embedding needs vocab_size as input parameter, to know what values to expect.

Instead of nn.Embedding you could also use nn.Linear like

self.embedding = nn.Linear(self.input_size, self.hidden_size)

In this case, yes, this would require an input of (batch_size, seq_len, vocab_size), with each token represented by a one-hot vector. The result would be exactly the same! nn.Embedding “only” simplifies this, making it much more convenient in practice for many reasons.

I’ve actually prepared a detailed Jupyter notebook for my students explaining this with an elaborate example. I hope this makes it fully clear. It’s not quite polished and thus not on GitHub yet.

1 Like

Thanks for your reply, the turtorial I am currently studying from does not explain this !! Got it now.