Perceptual loss from discriminator network

terbed · March 12, 2021, 5:37pm

Hi there,

I try to generate audible speech from impaired speech with GANs. There is a constraint that the generated sound should resemble the input (and keep the context): min(input, output). But here I want to use higher-level features utilizing the layers of the discriminator network. I created the discriminator network such that it gives back the intermediate layers too:

class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        .
        .
        .

    def forward(self, x):

        layer_outputs = []

        l1 = self.layer1(x) 
        layer_outputs.append(l1)

        l2 = self.layer2(l1) 
        layer_outputs.append(l2)
        
        l3 = self.layer3(l2)  
        layer_outputs.append(l3)

        l4 = self.layer4(l3)  
        layer_outputs.append(l4)

        l5 = self.layer5(l4) 
        layer_outputs.append(l5)

        out = self.fully(l5)

        return out, layer_outputs

And there is a function that calculates the sum of the difference of intermediate layers between two inputs:

    def lap1_loss(D, x: tr.tensor, y: tr.tensor):
        """
        Implements the laplace loss for the discriminator layers.
        Input shape: (N, C, H, W)
        :return: the scalar loss value
        """
        assert x.shape == y.shape, "The shape of inputs must be equal."
        assert len(x.shape) == 4, "Input must be 4 dimensional."

        _, x_acts = D(x)
        _, y_acts = D(y)

        losses = tr.stack([trf.l1_loss(x_l, y_l)*2**(-2*l) for l, (x_l, y_l) in enumerate(zip(x_acts, y_acts))])
        loss = tr.sum(losses)

        return loss

It seems like it is working, but I am not sure that this is the right way of implementing this. My question is that is this solution correct / is there a more elegant solution?

Thanks!