I am trying to reproduce PSPNet using PyTorch and this is my first time creating a semantic segmentation model. I understand that for image classification model, we have RGB input = [h,w,3] and label or ground truth = [h,w,n_classes]. We then use the trained model to create output then compute loss. For example, output = model(input); loss = criterion(output, label).
However, in semantic segmentation (I am using ADE20K datasets), we have input = [h,w,3] and label = [h,w,3] and we will then encode the label to [h,w,1]. ADE20K has a total of 19 classes, so out model will output [h,w,19]. I am confused how can we then compute for the loss as the dimension of the label and the output are clearly different.
Any help or guidance on this will be greatly appreciated!