Can I split my input in multiple embeddings? How would pytorch compute gradients?

Thomas_J · November 8, 2024, 2:24am

Context: Learning pytorch, I’m trying to predict the next character given the past 1 character (Shakespeare input). ; I’m aware that there are smarter ways to predict things but this is a good toy example to learn the ins and outs of pytorch.

At first, there is just 1 embedding layer (number of chars as input, number of chars as output) and a loss function. Training works: the loss function decreases with epochs.

def __init__(self):
  # vocab_size = len(set(chars in shakespear))) = 65
  self.embed = nn.Embedding(vocab_size, vocab_size)

def forward(self, x):
  return self.embedding(x)

Now I would like to predict the next character based on the past 2 chars.
So my input is made of a tensor with 2 values. I aggrzgate the embeddings into a larger tensor and pass that into a prediction layer (linear).
I do like so:

def __init__(self):
  # 10 is totally "random" 
  self.embed = nn.Embedding(vocab_size, 10)
  self.linear = nn.Linear(2 * 10, vocab_size)

def forward(self, x):
  e1 = self.embedding(x[0])
  e2 = self.embedding(x[1])
  e = torch.cat([e1, e2], dim=1)
  out = self.linear(e)

Now my question is:
given that I’m not “chaining” layers properly (e.g with nn.Sequential), and that I’m “cutting” my input (x[0], x[1]) and reuniting it with cat, will gradients be computed correctly?
I’ve read about Autograd but it’s unclear how it infers the graph and if “everything” is accepted.
Thanks!

ptrblck · November 8, 2024, 4:07am

Yes, since indexing and torch.cat are differentiable operations.