I am trying to average subword embeddings to form a word-level representation. Each word has a corresponding start and end index, indicating which subwords make up that word.
sequence_output is a tensor of B * 384 * 768, where 384 is the max sequence length, and 768 is the number of features.
all_token_mapping is a tensor of B * 384 * 2, which contains a start and end index. It is padded with [-1, -1].
initial_reps is a tensor of num_nodes * 768, num_nodes is the sum of all the number of words (not subwords) in the different samples.
initial_reps = torch.empty((num_nodes, 768), dtype=torch.float32) current_idx = 0 for i, feature_tokens_mapping in enumerate(all_token_mapping): for j, token_mapping in enumerate(feature_tokens_mapping): if token_mapping == -1: # reached the end for this particular sequence break initial_reps[current_idx] = torch.mean(sequence_output[i][token_mapping:token_mapping[-1] + 1], 0, keepdim=True) current_idx += 1
My current code will create an empty tensor of length num_nodes, and a for loop will calculate the values at each index, by checking token_mapping and token_mapping for the correct slice of sequence_output to average.
Is there a way to vectorize this code?
In addition, I have a list that holds the number of words for each sample. i.e. the sum of all the elements in the list == num_nodes