In my model, I want to replace
CNN-based-char-embedding. For this how should I modify data tensor efficiently?
I have tensor of dimension
[batch_size, max_seq_length] and I want to convert it to
[batch_size, max_word_length, max_char_length]
For example, consider these two sequences,
[I love mango, I eat mango] with
batch_size=2 . For
LSTM-based-embedding, this will be encoded
[batch_size, max_seq_length] as follows where
space is represented by
padding token by
data = [[1,50,21,24,11,3,50,15,14,16,43,12,51], [1,50,3,14,4,50,15,14,16,43,12,51,51]]
In addition, I have a matrix that stores the positions where words are ending in sequence (position of
splits = [[1,6,12],[1,5,11]]
How should I utilize this
splits matrix to find out
maximum character_length efficiently in the complete batch? In this case, it is 5.
Finally, expected conversion of
data should be
[batch_size, max_word_length, max_char_length]:
data_modified = [[[1,51,51,51,51],[21,24,11,3,51],[15,14,16,43,12]],[[1,51,51,51,51],[3,14,4,51,51],[15,14,16,43,12]]
This can be done using loops. But it will decrease the training speed drastically. Can we do more efficiently?