I am currently working on building a VAE for text.
My problem is, that I don`t understand, how to convert my embedded vectors back to tokens in the Decoder using the same embedding weights/ vocabulary as in the Encoder.
Some open questions I have:
Do Encoder and Decoder need to use the same weights of nn.Embedding, i.e. the same mapping of tokens -> embedding. ( Current assumption: Yes)
Do I even need to transform the embeddings to tokens in order to get a good reconstruction loss? ( Current assumption: Yes)
Do I need the mapping anyways if I want to generate new text? ( Current assumption: Yes)
Let me walk you through the current situation:
First I use a tokenizer which produces a tensor of shape [batch_size, sentence_length] full of word tokens.
Now the nn.Embedding(…) layer embedds each token, i.e. transforms the tensor to [batch_size, sent_length, emd_dim].
(… then other stuff happens)
In the decoder we finally get to the point where we have a tensor of the same dimension, i.e. [batch_size, sent_length, emd_dim] and we want convert this to tokens.
Is there any smart way of doing this?
my naive/ brute force approach would be:
For each element of the input: ( each embedded vector[i,j,:])
We want to iterate through all the embedding weights
(the embedding of the entire vocabulary from nn.Embedding)
and find the matching embedding vector.
(if it exists, due to reconstruction error, i suppose there will be no exact match, hence we would need to compare to all the vectors in the nn.Embedding layer to find the one with minimal distance)
From this vector, we can take the index and use it as a token.
(only if the ordering in nn.Embedding is the same as in the tokenizer )
If the ordering is not the same, I would need to reembed each word from vocab and store the result in a table in order to lookup the results.
The way I implemented this approach, should be way too complicated, as the nn.Embedding site says that nn.Embedding is a:
A simple lookup table that stores embeddings of a fixed dictionary and size.
This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.
TL,DR: nn.Embedding transforms tokens/indicies -> embeddings I want to transform embeddings -> tokens/indicies, using the same nn.Embedding as defined in the model in another layer.
Any help is greatly appreciated!
PS: This is my first post here so any feedback concerning layout ect. is also welcome