Multi-Class Ground Truth Annotation and Classification Head for Three Elements with Three Classes


I’m currently working on a problem that involves annotating the classification ground truth for three elements, where each element can belong to one of three classes. An example of this type of annotation is: [0, 1, 2], [2, 0, 0], or [0, 2, 2]. This form of annotation is reminiscent of semantic segmentation, as it classifies each element within the array.

I’m seeking guidance on two main aspects of my project:

Classification Head: To tackle this problem, I’m unsure about the best way to design the classification head for my neural network. Specifically, I’m unsure whether I should use fully-connected layers (FC layers) or conv2d layers, given the nature of my task. Should I opt for one approach over the other, or perhaps a combination of both?

Loss Function: I’d also like advice on selecting an appropriate loss function for this task. Given the multi-class nature of my problem, I’d like to know which loss function is best suited for training my model.

Additional Information:

  • I’m using PyTorch for this project, and my dataset includes annotations in the format of [0, 1, 2] for each element.
  • The model architecture I’m using incorporates both CNN and fully-connected layers, but I’m open to recommendations on how to structure the classification head.

Expected Outcome: I’m looking for guidance on the best approach to designing the classification head and selecting the appropriate loss function for my problem. Any insights, recommendations, or best practices from the community would be greatly appreciated.

Thank you for your help and advice.

I don’t fully understand the use case since you describe it first as:

which sounds as if these lists contain class indices for 3 elements while later you explain:

I assume each batch contains 3 elements each belonging to one of 3 possible classes?
If so I also assume your target has the shape [batch_size, 3] containing values in [0, 2]?
In this case you could use nn.CrossEntropyLoss and output logits in the shape [batch_size, nb_classes=3, 3].

Thank you very much for the answer. Your understanding is exactly what I have in mind.
So I’m making the data shapes like this

'output' shape: (batch_size, 3, 3)
'target' shape: (batch_size, 3)

the ‘output’ is one hot encoder according to each element that looks like

[[0, 1, 0],
 [1, 0, 0],
 [0, 0, 1]]

And the groundtruth looks like

[ 1, 0, 2]

Using nn.CrossEntropyLoss the code runs without error.
However, during training, the accuracy for this is quite low ~0.51 both train and val.
I use the code below to get the accuracy

predicted_labels = torch.argmax(output, dim=2)
correct_predictions = (predicted_labels == gt)
position_accuracy = correct_predictions.float().mean(dim=0)
overall_accuracy = position_accuracy.mean()

Is there any problem with the inference code?

The output looks wrong since it should contain logits while yours is One-shot encoded for some reason.
Also, the calculation of predicted_labels is wrong as the argmax should be applied on dim1, which is the class dimension.

Could you provide me with an example of how the data should be and how to work with them?

I’ve mentioned the expected output shapes including their meaning in my first post. Only the last dimension with a size of 3 is “undefined” as it seems to be specific to your use case.