Hashing + embedding before passing the input to model

add023 · June 8, 2023, 2:54pm

My features are a mix of univalent/multivalent dense & sparse categorical string, and univalent/multivalent dense & sparse categorical int. I need to convert the categorical string features to integer ids which can be then used for embedding lookup. What I am trying to do is Keras equivalent of:

hashing_layer = tf.keras.layers.Hashing(vocab_size)(feature)
embedding = tf.keras.layers.Embedding(vocab_size, 2, sparse=True)(hashing_layer)
embedding = reduce_mean(embedding, axis=1)

eg, for a batch size of 3:

raw_feature = [['z'], [], ['aa', 'bb']]
hashing_layer = [[939], [], [72, 82]]
embedding = 
       [[[0.019482743, 0.03436062]], 
        [], 
       [[-0.01437955, 0.042842973],
       [-0.007321369, -0.049441278]]]
input_to_model = 
       [[ 0.01948274,  0.03436062],
       [        0,         0],
       [-0.01085046, -0.00329915]]

What’s the PyTorch way of implementing this? Should I create the hash ids inside the Dataset class and implement the embedding (maybe embeddingbag?) inside the model? And what’s the Pytorch way to implementing hash id (hashing + mod)? I found sigrid_hash but it needs the input to be an integer to begin with.