Context: Learning pytorch, I’m trying to predict the next character given the past 1 character (Shakespeare input). ; I’m aware that there are smarter ways to predict things but this is a good toy example to learn the ins and outs of pytorch.
At first, there is just 1 embedding layer (number of chars as input, number of chars as output) and a loss function. Training works: the loss function decreases with epochs.
def __init__(self):
# vocab_size = len(set(chars in shakespear))) = 65
self.embed = nn.Embedding(vocab_size, vocab_size)
def forward(self, x):
return self.embedding(x)
Now I would like to predict the next character based on the past 2 chars.
So my input is made of a tensor with 2 values. I aggrzgate the embeddings into a larger tensor and pass that into a prediction layer (linear).
I do like so:
def __init__(self):
# 10 is totally "random"
self.embed = nn.Embedding(vocab_size, 10)
self.linear = nn.Linear(2 * 10, vocab_size)
def forward(self, x):
e1 = self.embedding(x[0])
e2 = self.embedding(x[1])
e = torch.cat([e1, e2], dim=1)
out = self.linear(e)
Now my question is:
given that I’m not “chaining” layers properly (e.g with nn.Sequential), and that I’m “cutting” my input (x[0], x[1]
) and reuniting it with cat
, will gradients be computed correctly?
I’ve read about Autograd but it’s unclear how it infers the graph and if “everything” is accepted.
Thanks!