Multi-class classification beginner questions

I have some basic questions about multi-class classification that keep tripping me up. I will outline the problem below while also providing a sample code snippet.

I have created a dataset of fixed input size of 800 and feature size of 768. If it helps, you may think of this as something like word embeddings for fixed length input sequence size (800) that have 768 features. The number of examples is 100. Each example has a label assigned to it from a total of 18 classes.

The model consists of 12 transformer encoder blocks and a single linear layer on top. I use this model to perform classification. Here is the code snippet:

import torch
from torch import nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from import Dataset, DataLoader
from torch.optim import AdamW
from typing import (Dict)

class MyDataset(Dataset):
    def __init__(self) -> None:

        self.train_data_list = [torch.rand(800, 768) for ii in range(100)]
        self.train_labels_list = torch.randint(18, (100,))

    def __len__(self):
        return len(self.train_data_list)

    def __getitem__(self, idx):
        example = self.train_data_list[idx]
        label = self.train_labels_list[idx]
        return {'example': example, 'label': label}

class EncoderClassifierModel(nn.Module):
    def __init__(self):
        encoder_layer = TransformerEncoderLayer(d_model=768, nhead=12, batch_first=True)
        self.encoders = TransformerEncoder(encoder_layer, num_layers=12)
        self.classifier = nn.Linear(768, 18)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, example:torch.Tensor=None, label:torch.Tensor=None) -> Dict[str, torch.Tensor]:
        encoder_out = self.encoders(example)
        logits = self.classifier(encoder_out)
        avg_logits = torch.mean(logits, 1)
        loss = self.loss_fn(avg_logits, label)
        return {'avg_logits': avg_logits, 'loss': loss}

#my_device = 'cuda:0'
my_device = 'cpu'
train_dataset = MyDataset()
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)

my_model = EncoderClassifierModel()
my_optimizer = AdamW(my_model.parameters())

for epoch in range(10):
    for batch in train_dataloader:
        batch = {k: for k, v in batch.items()}
        outputs = my_model(**batch)
        loss = outputs['loss']

My questions are as follows:

  1. Should there be some random initialization of the weights? I haven’t found this to be consistently the case, but I know it was mentioned in many of my courses. I also see it in some of the pytorch examples.

  2. Is logits the right shape (4,800,18) for this problem?

  3. As the output of the linear layer has shape (4,800,18), it would seem to me that I need to aggregate over the sequence length in order to obtain some sort of prediction by the model. This would result in avg_logits.shape() = (4,18). Since I’m using CrossEntropyLoss, I pass avg_logits and label to it and get the loss out. Is this the correct procedure?

  4. Importantly, it would seem to me that I should be doing something like an argmax on avg_logits to get the prediction as an integer and pass that and label to CrossEntropyLoss, but I read in the documentation that CrossEntropyLoss takes unnormalized logits as input. It seems like the model will have a hard time trying to get close to an integer prediction when avg_logits is a tensor of floats, but maybe I misunderstand CrossEntropyLoss?

If I can provide any other information please let me know. Thank you in advance for your help!

  1. Parameters are randomly initialized during the model init and use the method defined in the corresponding reset_parameters method of the module. You could of course use a custom initialization, if needed.

  2. The logits shape indicates a prediction of [batch_size=4, nb_classes=800, seq_len=18], which sounds wrong based on your explanation, so you might want to permute the tensor.

  3. You could treat the prediction as a sequence (assuming your target also contains labels for each time step) or “reduce” it somehow as you’ve already described. Using the mean might work, but I’m also seeing users indexing the “last step” of the sequence for the final prediction, so you might want to experiment with a few different approaches.

  4. No, you should not pass the predictions created by argmax to the criterion, as the argmax operation is not differentiable. nn.CrossEntropyLoss expects raw logits and will use F.log_softmax internally. You can still apply argmax to create the predictions to calculate the accuracy, but don’t use it for the loss calculation.

@ptrblck Thank you for your reply. You’ve cleared things up for me on points 1, 3, and 4. If you don’t mind I’d like to elaborate a bit more on point 2, as I may have some confusion there.

As I understood, the logits coming out of self.classifier would be [batch_size=4, seq_len=800, nb_classes=18] since the tensor coming out of that layer has shape (4, 800, 18). I assumed that is what the documentation was indicating for torch.nn.Linear(in_features, out_features), but maybe I have misunderstood.

Given that I have batch_size=4 and each example has a length of 800 with 768 features, how would one go about making a classification prediction over 18 different classes using a series of transformer encoder blocks and a classifier head on top?

Thank you for your assistance and I look forward to your reply.

You are right that the linear layer will output a shape of [batch_size, *additional_dims, out_features]. nn.CrossEntropyLoss expects logits in the shape [batch_size, nb_classes, *additional_dims] so you would need to permute the output into the desired shape as seen here:

lin = nn.Linear(768, 18)
x = torch.randn(4, 800, 768)
out = lin(x)
# torch.Size([4, 800, 18])

out = out.permute(0, 2, 1).contiguous()

criterion = nn.CrossEntropyLoss()
target = torch.randint(0, 18, (4, 800))

loss = criterion(out, target)

@ptrblck I reread through the documentation and I see where my misunderstanding comes from. Thank you for your help.

With respect to the target variable, the entire sequence has a label instead of each item in the sequence having a label which results in an error of:

RuntimeError: Expected target size [4, 800], got [4]

This was partially why I opted to average over the logits, but I’m not sure that is the best approach. How would you better construct the targets and the code given that each sequence has a single label as opposed to each member in a sequence having an individual label?

I don’t know which approach would work the best for you but you could experiment with e.g. indexing the last time step and use it for the loss calculation or reducing it via mean, sum etc.

@ptrblck Ok I see. Given how the labels are set up in my case I would need to reduce the logits tensor somehow (via sum, mean, etc.) or utilize the last time step (i.e. logits[:-1:-1]). In other words, the updated code snippet would look like:

lin = nn.Linear(768, 18)
x = torch.randn(4, 800, 768)
out = lin(x)
# torch.Size([4, 800, 18])

out = out.permute(0, 2, 1).contiguous()

#out = torch.mean(out, 2)
#out = out[:-1:-1]

criterion = nn.CrossEntropyLoss()
target = torch.randint(0, 18, (4, ))

loss = criterion(out, target)

Have I understood correctly?

Yes, you understood the issue correctly but would need to index the last timestep via:

out = out[..., -1]

@ptrblck Thank you very much for your help!