In my model, I want to replace LSTM-based-char-embedding
with CNN-based-char-embedding
. For this how should I modify data tensor efficiently?
I have tensor of dimension [batch_size, max_seq_length]
and I want to convert it to [batch_size, max_word_length, max_char_length]
For example, consider these two sequences, [I love mango, I eat mango]
with batch_size=2
. For LSTM-based-embedding
, this will be encoded [batch_size, max_seq_length]
as follows where space
is represented by 50
and padding token
by 51
.
data = [[1,50,21,24,11,3,50,15,14,16,43,12,51], [1,50,3,14,4,50,15,14,16,43,12,51,51]]
In addition, I have a matrix that stores the positions where words are ending in sequence (position of space
token)
splits = [[1,6,12],[1,5,11]]
How should I utilize this splits
matrix to find out maximum character_length
efficiently in the complete batch? In this case, it is 5.
Finally, expected conversion of data
should be [batch_size, max_word_length, max_char_length]
:
data_modified = [[[1,51,51,51,51],[21,24,11,3,51],[15,14,16,43,12]],[[1,51,51,51,51],[3,14,4,51,51],[15,14,16,43,12]]
This can be done using loops. But it will decrease the training speed drastically. Can we do more efficiently?