Can I split my input in multiple embeddings? How would pytorch compute gradients?

Context: Learning pytorch, I’m trying to predict the next character given the past 1 character (Shakespeare input). ; I’m aware that there are smarter ways to predict things but this is a good toy example to learn the ins and outs of pytorch.

At first, there is just 1 embedding layer (number of chars as input, number of chars as output) and a loss function. Training works: the loss function decreases with epochs.

def __init__(self):
  # vocab_size = len(set(chars in shakespear))) = 65
  self.embed = nn.Embedding(vocab_size, vocab_size)

def forward(self, x):
  return self.embedding(x)

Now I would like to predict the next character based on the past 2 chars.
So my input is made of a tensor with 2 values. I aggrzgate the embeddings into a larger tensor and pass that into a prediction layer (linear).
I do like so:

def __init__(self):
  # 10 is totally "random" 
  self.embed = nn.Embedding(vocab_size, 10)
  self.linear = nn.Linear(2 * 10, vocab_size)

def forward(self, x):
  e1 = self.embedding(x[0])
  e2 = self.embedding(x[1])
  e = torch.cat([e1, e2], dim=1)
  out = self.linear(e)

Now my question is:
given that I’m not “chaining” layers properly (e.g with nn.Sequential), and that I’m “cutting” my input (x[0], x[1]) and reuniting it with cat, will gradients be computed correctly?
I’ve read about Autograd but it’s unclear how it infers the graph and if “everything” is accepted.
Thanks!

Yes, since indexing and torch.cat are differentiable operations.

1 Like