def __init__(self, conv_index: str = '22'):
super(VGG, self).__init__()
vgg_features = torchvision.models.vgg19(pretrained=True).features
modules = [m for m in vgg_features]
if conv_index == '22':
self.vgg = nn.Sequential(*modules[:8])
elif conv_index == '54':
self.vgg = nn.Sequential(*modules[:35])
vgg_mean = (0.485, 0.456, 0.406)
vgg_std = (0.229, 0.224, 0.225)
self.vgg.requires_grad = False
def forward(self, sr: torch.Tensor, hr: torch.Tensor) -> torch.Tensor:
def _forward(x):
x = self.vgg(x)
return x
vgg_sr = _forward(sr)
with torch.no_grad():
vgg_hr = _forward(hr.detach())
loss = F.mse_loss(vgg_sr, vgg_hr)
return loss
So my questions are:
1- I think the idea is to use a layer in the middle blocks because we dont want the output of the task specific layer, so, is the layer 22 better than 35?
2- Does it make sense to calculate the mse on many levels i.e. 22 and 35.
3- Is it better to use the VGG loss in combination with the the Total variation loss and the MSE, if so what do you think the ration would be?
Which layer to use?
It really depends on what you’re trying to achieve:
Conv2_2 (layer 8): Captures low-level features like edges, textures, colors. Great for style transfer or when you care about fine details.
Conv5_4 (layer 35): Captures high-level semantic features. Better for when you want images to “look similar” conceptually.
For most super-resolution/enhancement tasks, conv2_2 or conv3_3 tend to work better because you want to preserve textures and details. The deeper layers are too abstract.
Using multiple layers?
This is actually pretty common and often works better
Combining with other losses?
Sure. You can try this approach
MSE/L1: Keep this as your base (weight=1.0)
Perceptual: Usually much smaller (0.001-0.01) because VGG features have larger magnitudes
TV loss: Tiny weight (1e-8 to 1e-6) - a little goes a long way
So far, I have only used layer 22 (I think this is conv5_4). Yes, the output is pre-activation so, there is a decent variance, I avent tested post-activation yet.
I am trying to generate images of microstructures, so not only the images have to look good, they also have to much the ground truth data on many other criteria.
I notices that using the VGG + VT + MSE dominate the discriminator loss very rapidly, and will lead to mode collapse by epoch 10. reducing the weight or removing the VT + MSE, delay the collapse by about 7 epochs but introduces chequerboard effect.
Though, using the perceptual loss definitely made the images look sharper, crispier, and made the generator converge faster.