Suppose I have a model that takes as input a multi-level mask like a segmentation map where each pixel can take one of N > 2 classes.
My question is should the input shape be (batch_size, num_classes, height, width) or can I instead use only one channel to encode the class index with an input shape of (batch_size, 1, height, width).
With output shape its pretty clear that it is (batch_size, num_classes, height, width). But what about the input shape?
You can encode the input information as you want.
E.g. you could certainly pass the input in a one-hot encoded way, as a single channel image, or even a color image, where each color represents a certain class.
The “right” approach also depends on your current model.
I.e. are you creating a model from scratch? If so, try out different approaches and check their performance on the validation set.
On the other hand, if you are thinking about fine tuning a model, most pretrained models use 3 input channels, so that you could have to adapt your input shape to it if you don’t want to replace/manipulate the first convolution.
Thanks for your great advice. I’m trying to reimplement a model from a paper from scratch by literally following whatever is reported in the paper. However, some details seemed to be missing from the paper.
Thank you once again, and stay safe.