I try to generate audible speech from impaired speech with GANs. There is a constraint that the generated sound should resemble the input (and keep the context): min(input, output). But here I want to use higher-level features utilizing the layers of the discriminator network. I created the discriminator network such that it gives back the intermediate layers too:
class Discriminator(nn.Module): def __init__(self): super().__init__() . . . def forward(self, x): layer_outputs =  l1 = self.layer1(x) layer_outputs.append(l1) l2 = self.layer2(l1) layer_outputs.append(l2) l3 = self.layer3(l2) layer_outputs.append(l3) l4 = self.layer4(l3) layer_outputs.append(l4) l5 = self.layer5(l4) layer_outputs.append(l5) out = self.fully(l5) return out, layer_outputs
And there is a function that calculates the sum of the difference of intermediate layers between two inputs:
def lap1_loss(D, x: tr.tensor, y: tr.tensor): """ Implements the laplace loss for the discriminator layers. Input shape: (N, C, H, W) :return: the scalar loss value """ assert x.shape == y.shape, "The shape of inputs must be equal." assert len(x.shape) == 4, "Input must be 4 dimensional." _, x_acts = D(x) _, y_acts = D(y) losses = tr.stack([trf.l1_loss(x_l, y_l)*2**(-2*l) for l, (x_l, y_l) in enumerate(zip(x_acts, y_acts))]) loss = tr.sum(losses) return loss
It seems like it is working, but I am not sure that this is the right way of implementing this. My question is that is this solution correct / is there a more elegant solution?