I’ve been working on an image classification network, and all the examples I find show, as a final step, flattening the final convolution layer into at least one fully connected linear layer. I’m hoping someone can elaborate on why this is useful, and if there are other methods.
For example, I’m currently attempting to classify images into 5 different classes. So I am creating a 5 channel image with the same dimensions as my input, where-by the channel for the target class is a white square and all other channels are black squares.
My thought process here is that by comparing the output of my model’s predictions, overlaid with the target image, I could see certain feature details that the model thinks should be associated with a particular class.
once I have that final 5 channel image, it’s trivial enough to get the mean of each channel and using argmax can determine a final number to present. But I’m wondering if there is any advantage during training if I only calculate the error against the 5 channel target images.
Hopefully this makes some sense. But if anyone could point me to an explanation for why it is so common to flatten the final convolution layers I would really appreciate it. Flattening all those images into a long line of numbers just doesn’t make logical sense to me and I hope someone can straighten men out.