How to use textual, numerical and categorical features together

l1b3rty · October 13, 2020, 8:05pm

It’s been a bit time for me to look for an example of using textual, numerical and categorical features together but I couldn’t find one. It would be nice to see a concrete example. Can you help with that?

a_d · October 14, 2020, 11:14am

Hello,
Firstly, it actually depends on how this data is correlated, and understanding that would be the first step incorporating this data in your model. You may want to introduce some data in the later layers of your model and sometimes you can pass the data to your model all at once.

Secondly, you might want to extract features from your data, and doing so from categorical data is tough, especially because one hot encoded representations of such data would result in sparse tensors, and applying normal layers of such data would yield in meaningless results. I generally apply one of the two strategies, One is such that you would use an embedding layer for one hot encoded representations of each categorical field and then concatenate them. Another way is to use sparse convolutions and sparse linear layers. Nvidia has a package called Minkowski Engine which has sparse implementations of the convolutional operations. You might want to go through the theory first to see which fits your case better.

To load the text data, load them as the pretrained vector representations of the words. Is it a standard dataset with which you are working ?

a_d · October 14, 2020, 3:10pm

My previous reply answered the way you could include categorical data in your model.
Your processing layers will be a part of your model as they would be trainable.
Could you please be more specific as to what exactly you are looking for? Where are you getting stuck? and maybe with some dummy or a single datapoint from your dataset.

a_d · October 15, 2020, 9:18am

Gimme some time, I will fo through it and get back to you…

a_d · October 16, 2020, 12:09pm

Not yet, got busy a little, but I assume your main problem converting textual data to tensors? Am I correct?

a_d · October 16, 2020, 2:12pm

Hello ,
I understand the issue now. Will I be right to assume that the numbers would denote some weightage to the text? In that case, simple element-wise of these weights(i.e. your numbers) to the vectorized text. Combining these three pieces of information might be done through some operation which relates them. This might be the way to go.

a_d · October 16, 2020, 2:30pm

Okay so you have many nodes containing data(like a graph or tree)…
Well given you have text, you would be padding them to make them the same length. You might as well pad these numbers with zeros to make them the same length as your sentence, and concat them with the embeddings.
On a side node, if your data is structured like a tree, have you tried out graph networks and structured learning ?

a_d · October 16, 2020, 2:33pm

Happy to help