Hi all! I’m fairly new to PyTorch and still understanding how Autograd works.
I am experimenting with fusing different word embeddings to feed to a Neural Network and am not sure if the operations I am doing will be tracked correctly and differentiated as intended.
Let me explain better:
I am building an embedding fusion layer that is intended to behave exactly like a TimeDistributedDense layer in Keras. Namely, this layer takes multiple embeddings for each token and fuses them by applying the same Linear transformation to all of them. However, I do not want to perform fusion for padding tokens, which are always all zeros and so I am skipping it by doing this:
def _apply_to_nonzero(self, x):
# find all non-zero vectors in the batch - these are the non-padding elements
non_zeros = torch.tensor([torch.max(v).item() > 0 for v in x])
# build zero matrix for the batch using the output size of the Linear layer (self.output_size)
all_zeros = torch.zeros((*x.size()[:-1], self.output_size))
# now fill the non-zeros indexes of the zero matrix with the fused embeddings
all_zeros[non_zeros] = self._fuse(x[non_zeros]).
# at this point, all_zeros contains the fused embeddings for tokens that are not padding and
# zero vectors of length self.output_size, in the indexes that contained padding elements.
x = all_zeros
return x
The code above runs as intended, but are my operations going to be correctly differentiated? (what concerns me is the fact that I’m slicing and moving slices across different tensors…)