Fully understand FCN

Hello everyone,
I’m trying to figure out how an fully convolutional image for segmentation works.
I found the following image


I understand the structure till the last blue rectangle (7x7x4096), because this is just a normal cnn structure.
After that we are doing a 1x1 convolution to reduce the number of feature maps right? To be precise, according to the image the number of filters are reduced to the number of classes?
After that the process of umsampling reaches the same resolution as the input image.

Two questions now:

  1. I don’t understand it why we are using K filters, if K is the number of classes. How is the output of 224x244xK interpreted? How do we get the colored/segmented output image like in this picture?

  2. What does the train data look like? What is the ground truth and what is the loss function? How do we calculate the loss between the segmented image and ground truth?

Thanks for helping.

This video explains segmentation in NNs intutively: https://www.youtube.com/watch?v=NzY5IJodjek&t=1357s . Your first question is answered at the end. Loss is found via flattening out the output array and doing a sum of classification loss for each point in the array or doing dice loss which is a ratio of areas.

thanks for your answer.
I watched the video, just that I get it right: According to my image above at the end I have a 240x240xK Image, were for every pixel there are K probabilities. The highest probability shows the class of the pixel right?

With regard to your second answer, that would mean that e.g. just use some kind of cross-entropy like: the truth should be [ 0 0 1 0 ] (class 3) but the fcn predicts [ 0.2, 0.8, 0.4, 0.1] right?