Hello all! I’m doing some experiments for my bachelor’s thesis. The question is pretty straightforward: I need to embed some floating values into a 1024 vector and concat it with other embedded vectors (those are the usual result of the embedding of tokens). How can I do that?
Long story short
I’m finetuning a sentence embedder with normal token embeddings and some floating values (a float between 0 and 30). I’m struggling to find a way to propely encode a float into a 1024 vector (model’s hidden size). Right now, my approach has been:
embed all the tokens to get the list of token embeddings (thus creating a tensor of size n_tokens, 1024). For every float, create a tensor with all zeros except for the last element, so each float creates a tensor of [0, 0… f] where shape is (1, 1024) and the last element is the normalized floating number (between 0 and 1). I used this approach because the euclidean distance between floating point vectors with all zeros except the last one should be the same as the distance of the same floating numbers. After that, I manually concat the tokens (token embedded tensors and floating embedded tensors, as described above) placing them where I need, and finally I use the complete embedded tensor as the transformer input.
The transformer should then compute them, create positional embeddings etc.
This approach does not actually produce errors during training and evaluation.
However, to find out if that system works before starting the long train, I tried to create a POC new transformer model that takes as input 10 floating points embedded like described before (so each one becomes a tensor of size 1, 1024, resulting in a 10, 1024 input tensor) with values between 0.1 and 0.3, sum all the last hidden states, use a linear model at the end that takes the summed 1024 tensor and output a single floating tensor, which should represent the smallest number of the input tokens. I found out that the model is not learning propely to identify the minimum: it stabilizes the error to 0.01 because it outputs values close to 0.11, where the average minimum is. As a loss I simply used the absolute of the difference between the target (minimum of the input) and the output (MSE breaks the model going very low with extremely small outputs). I tried to add a linear layer at the beginning to let the model learn how to encode the single float number into a 1024 vector (so no more manual embedding), and I also tried to fill all the 1024 elements with the floating number. All the solutions I tried simply don’t work and have the same outcome. That’s what I tried so far:
Thanks for the help! I noticed that online many people asked that but nobody actually answered or they were trying to use transformers in wrong ways.