How to modify data tensor efficiently to replace LSTM-char-embedding with CNN-char-embedding?

jivnesh_Sandhan · April 8, 2020, 2:05pm

In my model, I want to replace LSTM-based-char-embedding with CNN-based-char-embedding. For this how should I modify data tensor efficiently?

I have tensor of dimension [batch_size, max_seq_length] and I want to convert it to [batch_size, max_word_length, max_char_length]

For example, consider these two sequences, [I love mango, I eat mango] with batch_size=2 . For LSTM-based-embedding, this will be encoded [batch_size, max_seq_length] as follows where space is represented by 50 and padding token by 51.

data = [[1,50,21,24,11,3,50,15,14,16,43,12,51], [1,50,3,14,4,50,15,14,16,43,12,51,51]]

In addition, I have a matrix that stores the positions where words are ending in sequence (position of space token)

splits = [[1,6,12],[1,5,11]]

How should I utilize this splits matrix to find out maximum character_length efficiently in the complete batch? In this case, it is 5.

Finally, expected conversion of data should be [batch_size, max_word_length, max_char_length]:

data_modified = [[[1,51,51,51,51],[21,24,11,3,51],[15,14,16,43,12]],[[1,51,51,51,51],[3,14,4,51,51],[15,14,16,43,12]]

This can be done using loops. But it will decrease the training speed drastically. Can we do more efficiently?