Initialization for fully-conv FCN models: why so crucial?

Hello,
I am training a fully convolutional model for segmentation following the FCN paper with a VGG16 encoder. I know it is usually helpful to use pretrained weights to initialize any deep architecture but can anyone provide an intuition on why it is so crucial to have pertrained weights for the encoder of FCN? As the paper mentions, training a FCN from scratch is very hard. I am just wondering what is special about an FCN architecture that makes it so difficult to train from scratch (vs a regular classification CNN that can be trained from scratch reasonably easily)?

Thanks for your thoughts!

Well, at the end it depends on the task you want to perform. As example to train a SOAT deep neural network on this dataset http://www.vision.caltech.edu/visipedia/CUB-200.html the difference in using an ImageNet pre-trained model and training from scratch is up to a 20% in accuracy or greater.

This means that not only for the FCN, but for any challenging task using pre-trained models on good and general datasets, such as ImageNet, improve performance. This happens because the model has already learned to extract general features for a general task (image-classification, object-detection, language-classification…) and you finally train your model with a fine-tunning. It is something similar as what Hinton did with the stacked restricted boltzmann machines models.

1 Like

Thank you @jmaronas for your response. I get that transferring pretrained weights generally improves performance. However, in my experience with training FCNs, I have had little luck actually getting any learning from scratch. The paper also emphasizes that you do “need” pretrained weights. Even the original Github by the authors of the FCN paper states:

Why are all the outputs/gradients/parameters zero? : This is almost universally due to not initializing the weights as needed. To reproduce our FCN training, or train your own FCNs, it is crucial to transplant the weights from the corresponding ILSVRC net such as VGG16.

This makes me think it’s probably not just me and that in some cases one might not get any results without pretrained weight, for FCNs. My question is what makes fully convolutional networks so dependent on pretrained weights?

Well, I think is not actually the fully convolutional network, but the complexity of the task you are performing.

You could try, instead of using a fcn, use a normal convolutional network and check if the invariance provided by max pooling operators makes you learn something.