Tensor and Sequence dataset

Hi everyone!

I have a dataset which consists in pairs of embeddings and strings (sentences). The goal is to learn to map them, so after that I could be able to generate the string from a given embedding, and viceversa. How should I approach this? Any code reference?

Thank you all!

If there is a lookup table for a string to embedding generation and it is a bijective dict (every string maps to one embedding and vice versa), it is possible to just store them.

Short of that, you are trying to learn the inverse of a fn. Imagine the simplest case: the sentence embedding is a real vector and is generated by taking an average of token embeddings. So in the reverse task, given a sentence embedding S, you would like to find the tokens, whose embeddings, when averaged, would produce S. It seems really difficult unless the token embeddings have some specific properties.

This could be a multivariate regression problem, but I can’t imagine the usual constraints for that problem would be satisfied here.

1 Like

Embeddings are embeddings of a word, and the sequence it’s that word’s definition in a real dictionary. Does it makes sense?

The best thing I can think of is using a sort of Sequence to Sequence Model

The training will be in two parts

  • Training an encoder to convert strings into their respective encoding
  • Training a decoder to convert the trained encoders encoding into a string