VGG Perceptual Loss for really High Definition images


The perceptual loss has become very much prevalent with an example shown in this code. However mostly I see people using VGG16 and not VGG19. This could be because generally people use low to medium resolution images such as 400x600 and so the depth of VGG16 may be sufficient. However the output of my denoising network is a high definition image which is 2048 x 4096 and has perfectly registered corresponding ground truth image. Of course at the time of training I just use a patch of 512 x 512 and perceptual loss will be computed over this patch only.

So I am not clear on the following issue regarding which will be the best practice:-

  • Use VGG16 or VGG19. Since VGG19 has more layers I naively believe that it should be used although most implementations on GitHub still use VGG16.

  • Even in VGG16 people do not use all the layers. Mostly middle layers. But since I want my restored image to be an identical replica of the ground truth, in the best possible case can I naively use L1norm difference on all max pool layers of VGG? Or should I just use deep layers and skip the shallow layers or just use the middle layers?

  • Also many implementations such as this one resize the image to 244 x 244. But since mine is a much larger image; bilinearly interpolating 512 x 512 images to this small resolution may be problematic. So what is better? Passing the full resolution image through VGG or the downsized version for computing perceptual loss.

Thank you very much.

There is not much benefit in the last layers. As these layers become more and more task specific, while we want something that can guide our model to generalize. Also, most of the basic things in the image is easily incorporated from low-to-middle layers of the model. But if you want, you can use last layers, but it is more of a diminishing returns thing.

Also, this also confuses me why people use VGG16, whereas we have much better classification models. One reason that I once found was because it just works and so nobody bothers to change it, as you would have to choose the layers from where you want features if you use a new architecture.

1 Like

Thankyou for this understanding. I now feel to take loss from shallow and middle layers only.

as you would have to choose the layers from where you want features

The famous paper Perceptual Losses for Real-Time Style Transfer
and Super-Resolution
has the following diagram

Screenshot from 2020-05-26 08-56-40

According to this for content loss relu3_3 is used but the in the description the paper says,

For all style transfer experiments we compute feature reconstruction loss at layer relu2_2

So can you please additionally help me which one will be more suitable for denoising task when perfectly aligned GT is available.


I think it comes to experimentation. Hard to say anything in deep learning without training a model.