Tensor and Sequence dataset

Oscar_Garcia · December 17, 2021, 11:23am

Hi everyone!

I have a dataset which consists in pairs of embeddings and strings (sentences). The goal is to learn to map them, so after that I could be able to generate the string from a given embedding, and viceversa. How should I approach this? Any code reference?

Thank you all!

sagsriv · December 17, 2021, 2:21pm

If there is a lookup table for a string to embedding generation and it is a bijective dict (every string maps to one embedding and vice versa), it is possible to just store them.

Short of that, you are trying to learn the inverse of a fn. Imagine the simplest case: the sentence embedding is a real vector and is generated by taking an average of token embeddings. So in the reverse task, given a sentence embedding S, you would like to find the tokens, whose embeddings, when averaged, would produce S. It seems really difficult unless the token embeddings have some specific properties.

This could be a multivariate regression problem, but I can’t imagine the usual constraints for that problem would be satisfied here.

Oscar_Garcia · December 17, 2021, 3:30pm

Embeddings are embeddings of a word, and the sequence it’s that word’s definition in a real dictionary. Does it makes sense?

Dace · December 17, 2021, 3:49pm

The best thing I can think of is using a sort of Sequence to Sequence Model

The training will be in two parts

Training an encoder to convert strings into their respective encoding
Training a decoder to convert the trained encoders encoding into a string