VGG perceptual loss on multi resolution

40uf411 · June 26, 2025, 8:59pm

Hi, based on my understanding, the perceptual loss uses either VGG 16 or 19.
I saw some implementations like this: torch_enhance.losses.vgg — PyTorch Enhance 0.1.3 documentation

    def __init__(self, conv_index: str = '22'):
        super(VGG, self).__init__()
        vgg_features = torchvision.models.vgg19(pretrained=True).features
        modules = [m for m in vgg_features]
        if conv_index == '22':
            self.vgg = nn.Sequential(*modules[:8])
        elif conv_index == '54':
            self.vgg = nn.Sequential(*modules[:35])
        vgg_mean = (0.485, 0.456, 0.406)
        vgg_std = (0.229, 0.224, 0.225)
        self.vgg.requires_grad = False

    def forward(self, sr: torch.Tensor, hr: torch.Tensor) -> torch.Tensor:
        def _forward(x):
            x = self.vgg(x)
            return x
        vgg_sr = _forward(sr)
        with torch.no_grad():
            vgg_hr = _forward(hr.detach())
        loss = F.mse_loss(vgg_sr, vgg_hr)
        return loss

So my questions are:

1- I think the idea is to use a layer in the middle blocks because we dont want the output of the task specific layer, so, is the layer 22 better than 35?

2- Does it make sense to calculate the mse on many levels i.e. 22 and 35.

3- Is it better to use the VGG loss in combination with the the Total variation loss and the MSE, if so what do you think the ration would be?

Thanks a lot in advance.

Hamza_Javaid · June 27, 2025, 6:54pm

Hey!

Which layer to use?
It really depends on what you’re trying to achieve:

Conv2_2 (layer 8): Captures low-level features like edges, textures, colors. Great for style transfer or when you care about fine details.
Conv5_4 (layer 35): Captures high-level semantic features. Better for when you want images to “look similar” conceptually.

For most super-resolution/enhancement tasks, conv2_2 or conv3_3 tend to work better because you want to preserve textures and details. The deeper layers are too abstract.

Using multiple layers?
This is actually pretty common and often works better

Combining with other losses?
Sure. You can try this approach

MSE/L1: Keep this as your base (weight=1.0)
Perceptual: Usually much smaller (0.001-0.01) because VGG features have larger magnitudes
TV loss: Tiny weight (1e-8 to 1e-6) - a little goes a long way

40uf411 · June 28, 2025, 10:15am

Thanks Hamza for your reply.

So, I am using this in a GAN setting.

So far, I have only used layer 22 (I think this is conv5_4). Yes, the output is pre-activation so, there is a decent variance, I avent tested post-activation yet.

I am trying to generate images of microstructures, so not only the images have to look good, they also have to much the ground truth data on many other criteria.

I notices that using the VGG + VT + MSE dominate the discriminator loss very rapidly, and will lead to mode collapse by epoch 10. reducing the weight or removing the VT + MSE, delay the collapse by about 7 epochs but introduces chequerboard effect.

Though, using the perceptual loss definitely made the images look sharper, crispier, and made the generator converge faster.

Thanks again Hamza.