Hello community,
I am trying to implement Fully convolution Networks (FCN) for semantic segmentation task on Pascal VOC 2012 dataset. I am having trouble regarding loading of dataset. My doubts are as follows:

Since in FCN-32 we get the output with dimension as = H x W x num_classes. Does this mean I have to convert my ground-truth segmentation maps to H x W x num_classes ? If yes, how can I generate one-hot encoded ground-truth images ?

What is the loss function that is used in such type of task apart from IOU, can categorical crossentropy be used in such situation ?

1, No, Load ground truth in H * W and let your network output H * W * num_classes
2, cross entroy loss can be OK(dense pixel-level classification problems)

@lxtGH Thanks for the advice it worked. Now I was able to run the entire network, but I was getting a very high loss (~3.0 to 1.8). Also since the output is of shape = H * W * num_classes, how should I plot this prediction to visualize my predictions?

First cat H * W * num_classes in H * W map by argmax, according the definition of semantic segmentation, each pixel represent a class, your can put each pixel with different color(RGB), one color represent a class

@lxtGH I used res.argmax(-1) given that res is my prediction of 21 channels. The outputt generated from res.argmax(-1) is 0 and get the output in following way:

How many iterations have you trained the model? I was once in the similar situation but after enough epochs, the model started to output segmentations.