Training a single embedding using masking

I am trying to train a one embedding layer using masking.

It takes a masked sentence of 10 tokens and predict the masked tokens.

The values are the ids of the tokens in a vocabulary.
E.g:. [223, 444, 1, 11, 53, 232, 1, 435, 12, 43]

The target is the same sequence with all non-masked tokens are replaced by empty token 0:
E.g.: [0, 0, 33, 0, 0, 0, 23, 0, 0, 0] During loss computation index 0 is ignored.

How to modle the classification problem. In layman terms for each entry I am computing
the class of 10 tokens. How this should be reflected in the network.

The network is as follow:

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(max_length * embedding_dim, max_length * vocab_size)

    def forward(self, x):
        embedded = self.embedding(x) # (batch_size, max_length, embedding_size)
        embedded = embedded.view(x.size()[0], -1) # (batch_size, max_length * embedding_size)
        output = self.linear(embedded) # (batch_size, max_length * vocab_size)
        output = output.view(?, ?, ?)
        return output

In particular what should be the shae of the output to be suitable for a
crossEntropyLoss computation.

(A): (N, C, 10)
or
(B): (N, 10, C)

Where C is the vocab_size or n_classes
Where N is batch size

I actually tried both.
(A) is working fine.
(B giving me an error: expected output of size [N, C] got [N, 10])

I’m not sure I entirely understand you use case, but given the mentioned output shapes I would say (A) is the expected shape as generally nn.CrossEntropyLoss expects a model output in the shape [batch_size, nb_classes, *additional_dimensions].

To make the problem clearer:

Say, in the context of natural language processing (NLP) I have
a model that predicts the next word. My vocabulary size is 60,000.
Hence the model output (logits) should be 60,000.
In this case the Cross Entropy module will get the argument of
the maximum activated unit using SoftMax and then apply entropy loss.

Now I want to extend the model to predict 10 tokens rather than only one.
The output size is (10x60,000) for each token we 60,000 units that we extract the
maximum activated unit from.

In the latter case how, we should reshape the module so that it works with cross entropy.
Is this multidimensional loss?? Or I should manually calculate softmax and argmax.

60,000 would refer to the class count in your example and should thus be in dim1. I’m unsure how tokens are used and if these refer to independent samples, in which case they should land in dim0 (the batch dimension), or if these tokens represent a temporal dependency they would land in dim2 where dim0 would still represent the batch dimension and could be 1 if a single sample with 60,000 tokens is used.

True. Solution as in nn.CrossEntropyLoss() for text with multiple dimension

However it is misleading to have the number of classes as a dimension.
It make sense to have the output as (batch_size, dimension1, nb_classes) which will be
reduced to (batch_size, dimension1).

It is misleading because from application perspective the output should be (batch_size, dimension1, nb_classes). You hve to add another functional operation to clear the difference.

For example check this bert training code. They added “output.transpose(1, 2)” operation before computing the loss.

I don’t think it’s generally misleading as the needed permutations would depend on the use case. E.g. segmentation models using a nn.Conv2d layer as the output define the output channels as the number of classes and would thus not need any transpose. It could of course differ for your language model, but changing the default would force transposes for other models.