How to use textual, numerical and categorical features together

It’s been a bit time for me to look for an example of using textual, numerical and categorical features together but I couldn’t find one. It would be nice to see a concrete example. Can you help with that?

Firstly, it actually depends on how this data is correlated, and understanding that would be the first step incorporating this data in your model. You may want to introduce some data in the later layers of your model and sometimes you can pass the data to your model all at once.

Secondly, you might want to extract features from your data, and doing so from categorical data is tough, especially because one hot encoded representations of such data would result in sparse tensors, and applying normal layers of such data would yield in meaningless results. I generally apply one of the two strategies, One is such that you would use an embedding layer for one hot encoded representations of each categorical field and then concatenate them. Another way is to use sparse convolutions and sparse linear layers. Nvidia has a package called Minkowski Engine which has sparse implementations of the convolutional operations. You might want to go through the theory first to see which fits your case better.

To load the text data, load them as the pretrained vector representations of the words. Is it a standard dataset with which you are working ?

My previous reply answered the way you could include categorical data in your model.
Your processing layers will be a part of your model as they would be trainable.
Could you please be more specific as to what exactly you are looking for? Where are you getting stuck? and maybe with some dummy or a single datapoint from your dataset.

Gimme some time, I will fo through it and get back to you…

1 Like

Not yet, got busy a little, but I assume your main problem converting textual data to tensors? Am I correct?

Hello ,
I understand the issue now. Will I be right to assume that the numbers would denote some weightage to the text? In that case, simple element-wise of these weights(i.e. your numbers) to the vectorized text. Combining these three pieces of information might be done through some operation which relates them. This might be the way to go.

Okay so you have many nodes containing data(like a graph or tree)…
Well given you have text, you would be padding them to make them the same length. You might as well pad these numbers with zeros to make them the same length as your sentence, and concat them with the embeddings.
On a side node, if your data is structured like a tree, have you tried out graph networks and structured learning ?

1 Like

Happy to help :slightly_smiling_face:

1 Like