I have some basic questions about multi-class classification that keep tripping me up. I will outline the problem below while also providing a sample code snippet.
I have created a dataset of fixed input size of 800 and feature size of 768. If it helps, you may think of this as something like word embeddings for fixed length input sequence size (800) that have 768 features. The number of examples is 100. Each example has a label assigned to it from a total of 18 classes.
The model consists of 12 transformer encoder blocks and a single linear layer on top. I use this model to perform classification. Here is the code snippet:
import torch
from torch import nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from typing import (Dict)
class MyDataset(Dataset):
def __init__(self) -> None:
super().__init__()
self.train_data_list = [torch.rand(800, 768) for ii in range(100)]
self.train_labels_list = torch.randint(18, (100,))
def __len__(self):
return len(self.train_data_list)
def __getitem__(self, idx):
example = self.train_data_list[idx]
label = self.train_labels_list[idx]
return {'example': example, 'label': label}
class EncoderClassifierModel(nn.Module):
def __init__(self):
super().__init__()
encoder_layer = TransformerEncoderLayer(d_model=768, nhead=12, batch_first=True)
self.encoders = TransformerEncoder(encoder_layer, num_layers=12)
self.classifier = nn.Linear(768, 18)
self.loss_fn = nn.CrossEntropyLoss()
def forward(self, example:torch.Tensor=None, label:torch.Tensor=None) -> Dict[str, torch.Tensor]:
encoder_out = self.encoders(example)
logits = self.classifier(encoder_out)
avg_logits = torch.mean(logits, 1)
loss = self.loss_fn(avg_logits, label)
return {'avg_logits': avg_logits, 'loss': loss}
#my_device = 'cuda:0'
my_device = 'cpu'
train_dataset = MyDataset()
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
my_model = EncoderClassifierModel()
my_optimizer = AdamW(my_model.parameters())
for epoch in range(10):
my_model.train()
for batch in train_dataloader:
batch = {k: v.to(my_device) for k, v in batch.items()}
my_optimizer.zero_grad()
outputs = my_model(**batch)
loss = outputs['loss']
loss.backward()
my_optimizer.step()
My questions are as follows:
-
Should there be some random initialization of the weights? I haven’t found this to be consistently the case, but I know it was mentioned in many of my courses. I also see it in some of the pytorch examples.
-
Is
logits
the right shape(4,800,18)
for this problem? -
As the output of the linear layer has shape
(4,800,18)
, it would seem to me that I need to aggregate over the sequence length in order to obtain some sort of prediction by the model. This would result inavg_logits.shape() = (4,18)
. Since I’m usingCrossEntropyLoss
, I passavg_logits
andlabel
to it and get the loss out. Is this the correct procedure? -
Importantly, it would seem to me that I should be doing something like an argmax on
avg_logits
to get the prediction as an integer and pass that andlabel
toCrossEntropyLoss
, but I read in the documentation thatCrossEntropyLoss
takes unnormalized logits as input. It seems like the model will have a hard time trying to get close to an integer prediction whenavg_logits
is a tensor of floats, but maybe I misunderstandCrossEntropyLoss
?
If I can provide any other information please let me know. Thank you in advance for your help!