I am trying to average subword embeddings to form a word-level representation. Each word has a corresponding start and end index, indicating which subwords make up that word.
sequence_output
is a tensor of B * 384 * 768, where 384 is the max sequence length, and 768 is the number of features.
all_token_mapping
is a tensor of B * 384 * 2, which contains a start and end index. It is padded with [-1, -1].
initial_reps
is a tensor of num_nodes * 768, num_nodes is the sum of all the number of words (not subwords) in the different samples.
initial_reps = torch.empty((num_nodes, 768), dtype=torch.float32)
current_idx = 0
for i, feature_tokens_mapping in enumerate(all_token_mapping):
for j, token_mapping in enumerate(feature_tokens_mapping):
if token_mapping[0] == -1: # reached the end for this particular sequence
break
initial_reps[current_idx] = torch.mean(sequence_output[i][token_mapping[0]:token_mapping[-1] + 1], 0, keepdim=True)
current_idx += 1
My current code will create an empty tensor of length num_nodes, and a for loop will calculate the values at each index, by checking token_mapping[0] and token_mapping[1] for the correct slice of sequence_output to average.
Is there a way to vectorize this code?
In addition, I have a list that holds the number of words for each sample. i.e. the sum of all the elements in the list == num_nodes
Thank you.