Hello,

I am currently working on building a **VAE for text**.

My problem is, that I don`t understand, how to **convert my embedded vectors back to tokens** in the Decoder using the same embedding weights/ vocabulary as in the Encoder.

Some open questions I have:

Do Encoder and Decoder need to use the same weights of nn.Embedding, i.e. the same mapping of tokens -> embedding. ( Current assumption: Yes)

Do I even need to transform the embeddings to tokens in order to get a good reconstruction loss? ( Current assumption: Yes)

Do I need the mapping anyways if I want to generate new text? ( Current assumption: Yes)

Let me walk you through the current situation:

First I use a tokenizer which produces a tensor of shape [batch_size, sentence_length] full of word tokens.

Now the nn.Embedding(…) layer embedds each token, i.e. transforms the tensor to [batch_size, sent_length, emd_dim].

(… then other stuff happens)

In the decoder we finally get to the point where we have a tensor of the same dimension, i.e. [batch_size, sent_length, emd_dim] and we want convert this to tokens.

**Is there any smart way of doing this?**

**my naive/ brute force approach would be:**

*For each element of the input:* ( each embedded vector[i,j,:])

**We want to iterate through all the embedding weights**

(the embedding of the entire vocabulary from nn.Embedding)

**and find the matching embedding vector.**

(if it exists, due to reconstruction error, i suppose there will be no exact match, hence we would need to compare to all the vectors in the nn.Embedding layer to find the one with minimal distance)

**From this vector, we can take the index and use it as a token.**

(only if the ordering in nn.Embedding is the same as in the tokenizer )

**If the ordering is not the same, I would need to reembed each word from vocab and store the result in a table in order to lookup the results.**

**The way I implemented this approach, should be way too complicated, as the** nn.Embedding site **says that nn.Embedding is a:**

A simple lookup table that stores embeddings of a fixed dictionary and size.

This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.

TL,DR: nn.Embedding transforms tokens/indicies -> embeddings I want to transform embeddings -> tokens/indicies, using the same nn.Embedding as defined in the model in another layer.

Any help is greatly appreciated!

PS: This is my first post here so any feedback concerning layout ect. is also welcome