My features are a mix of univalent/multivalent dense & sparse categorical string, and univalent/multivalent dense & sparse categorical int. I need to convert the categorical string features to integer ids which can be then used for embedding lookup. What I am trying to do is Keras equivalent of:
hashing_layer = tf.keras.layers.Hashing(vocab_size)(feature) embedding = tf.keras.layers.Embedding(vocab_size, 2, sparse=True)(hashing_layer) embedding = reduce_mean(embedding, axis=1)
eg, for a batch size of 3:
raw_feature = [['z'], , ['aa', 'bb']] hashing_layer = [, , [72, 82]] embedding = [[[0.019482743, 0.03436062]], , [[-0.01437955, 0.042842973], [-0.007321369, -0.049441278]]] input_to_model = [[ 0.01948274, 0.03436062], [ 0, 0], [-0.01085046, -0.00329915]]
What’s the PyTorch way of implementing this? Should I create the hash ids inside the Dataset class and implement the embedding (maybe embeddingbag?) inside the model? And what’s the Pytorch way to implementing hash id (hashing + mod)? I found sigrid_hash but it needs the input to be an integer to begin with.