I am a beginner in training machine learning models and am faced with a situation where I’d like to combine a one hot encoding of my input with some features that I hand selected using domain knowledge using PyTorch. For example, the one hot encoding could be of size 100 and I want to add say, 3 or 4 hand selected numerical features.
I attempted to do this by simply concatenating the two tensors together but the results weren’t very good. The loss did go down, but the accuracy was very low. I am looking for any feedback or guidance to be pointed in the right direction on how to accomplish this (that is, assuming my current approach won’t work for some reason). The exact encoding of the input can be changed - I have also considered things like Embedding layers, etc.
Thanks everyone for all the help.
I am guessing one-hot means one-hot per token in your vocabulary. So your feature per token is the concatenation of one-hot and some sort of custom feature for that token. I would probly have a model where you don’t use one-hot, but you use embeddings for each token (you can use glove or fast text or even bert). This gives you an embedding per token.
You can concatenate your to eac of these embeddings hand features to get a larger vector representation, but I don’t know if there is some rule which gaurantees you’ll do much better. BERT for example would take into account other tokens arund your token, so it might implictily contain your hand features. Do you have a spelled out example?
In principle, you can simply concatenate your one-hot vector with additional numerical features. You could also first push your one-hot vectors through 1 or more linear layers and then concatenate this output with your hand-selected features. The sky’s the limit :).
Regarding the effects on the results, did you check the accuracy using only the one-hot vectors vs. only the hand-selected features.
On a more fundamental level, the question is also where the one-hot vectors come from. Since you posted in the NLP category, are your inputs words? Then 100 seems kind of small. Or do you work with characters? What is your exact task?
As @dreidizzle , you could always consider an Embedding layer, which kind of mimics the idea above to first push the one-hot vectors through 1 or more linear layers (see above). The advantage of embeddings, in general, are that words/tokens with similar semantic meaning (e.g., quick and fast) tend to have similar embeddings. However, in contrast, characters don’t really feature something like semantic similarity (at least not like words). In this case, this advantage of embeddings would not as pronounced.
Can you give some insights into your training setup and goals?