FCN8s, first time need a hand!


I new to CNNs and pytorch, so I apologies if my question is obvious. I am doing a MSc project in semantic segmentation and trying to write an FCN8 (vgg16) model that’s true to original paper. What I am struggling with is the idea that any input size image can be used?

What I have is a kernal of 7x7 at “fc6”. Now I assume the original inputs for the model in the paper would have reduced to 7x7x4096 at this point? If this was the case then it would produce a 1x1x4096?

I do not know what size input they were actually using (if you know please let me know!) but, my input is currently 256x256 leading to a pool5 size of 15x15. This means fc6, with its 7x7 kernel drops it to a 9x9x4096. Is this ok, or do I need to change fc6 to a kernel of 15x15 in order to get down to a 1x1?

So I guess the question is, can any input size be used with the original architecture or do you have to adjust kernel sizes in order to ensure a 1x1x4096 after fc6?

Thanks in advance

No, you don’t need to go to 1x1. In fact, there is a lower bound what sizes you can input (because you need to stay >= 1x1).
Similar for UNets.

Thanks Thomas, this was my hunch. So in fact we end up with a feature map of at least 1x1 which we then up sample from, and the size of this feature map is related to the input image size. The question is then does it perform best with a certain size? I have made 256x256 tiles of a 4000x4000 satellite patch, would I be better using a different tiles size?

On a side note, is 256 tiles of this size enough to train on? I have seen a lot of examples using transfer learning and saying it is important to do, however I have 22 channels in my data, so I assume I can’t make use of the vgg16 pre trained model as that will be just 3?

First thing is to try… You probably want rather aggressive augmentation and maybe see if you can get more unlabeled data.

Regarding transfer learning: You could try to duplicate and scale down the three channel weights in the first layer or add just a little noise, but one of the key observations is that transfer learning works best when the pretraining inputs are related to the final purpose (e.g. imagenet pretraining and then using it for medical imaging is not as useful as pretraining on medical imaging to start with).

Regarding the input sizes: So performance will degrade if you dramatically change the resolution but you can get by with changing the size.