I am trying to train a one embedding layer using masking.
It takes a masked sentence of 10 tokens and predict the masked tokens.
The values are the ids of the tokens in a vocabulary.
E.g:. [223, 444, 1, 11, 53, 232, 1, 435, 12, 43]
The target is the same sequence with all non-masked tokens are replaced by empty token 0:
E.g.: [0, 0, 33, 0, 0, 0, 23, 0, 0, 0] During loss computation index 0 is ignored.
How to modle the classification problem. In layman terms for each entry I am computing
the class of 10 tokens. How this should be reflected in the network.
The network is as follow:
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(max_length * embedding_dim, max_length * vocab_size)
def forward(self, x):
embedded = self.embedding(x) # (batch_size, max_length, embedding_size)
embedded = embedded.view(x.size()[0], -1) # (batch_size, max_length * embedding_size)
output = self.linear(embedded) # (batch_size, max_length * vocab_size)
output = output.view(?, ?, ?)
return output
In particular what should be the shae of the output to be suitable for a
crossEntropyLoss computation.
(A): (N, C, 10)
or
(B): (N, 10, C)
Where C is the vocab_size or n_classes
Where N is batch size
I actually tried both.
(A) is working fine.
(B giving me an error: expected output of size [N, C] got [N, 10])