I am trying to train a one embedding layer using masking.

It takes a masked sentence of 10 tokens and predict the masked tokens.

The values are the ids of the tokens in a vocabulary.

E.g:. [223, 444, 1, 11, 53, 232, 1, 435, 12, 43]

The target is the same sequence with all non-masked tokens are replaced by empty token 0:

E.g.: [0, 0, 33, 0, 0, 0, 23, 0, 0, 0] During loss computation index 0 is ignored.

How to modle the classification problem. In layman terms for each entry I am computing

the class of 10 tokens. How this should be reflected in the network.

The network is as follow:

```
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(max_length * embedding_dim, max_length * vocab_size)
def forward(self, x):
embedded = self.embedding(x) # (batch_size, max_length, embedding_size)
embedded = embedded.view(x.size()[0], -1) # (batch_size, max_length * embedding_size)
output = self.linear(embedded) # (batch_size, max_length * vocab_size)
output = output.view(?, ?, ?)
return output
```

In particular what should be the shae of the output to be suitable for a

crossEntropyLoss computation.

(A): (N, C, 10)

or

(B): (N, 10, C)

Where C is the vocab_size or n_classes

Where N is batch size

I actually tried both.

(A) is working fine.

(B giving me an error: expected output of size [N, C] got [N, 10])