Hi there,
I try to generate audible speech from impaired speech with GANs. There is a constraint that the generated sound should resemble the input (and keep the context): min(input, output). But here I want to use higher-level features utilizing the layers of the discriminator network. I created the discriminator network such that it gives back the intermediate layers too:
class Discriminator(nn.Module):
def __init__(self):
super().__init__()
.
.
.
def forward(self, x):
layer_outputs = []
l1 = self.layer1(x)
layer_outputs.append(l1)
l2 = self.layer2(l1)
layer_outputs.append(l2)
l3 = self.layer3(l2)
layer_outputs.append(l3)
l4 = self.layer4(l3)
layer_outputs.append(l4)
l5 = self.layer5(l4)
layer_outputs.append(l5)
out = self.fully(l5)
return out, layer_outputs
And there is a function that calculates the sum of the difference of intermediate layers between two inputs:
def lap1_loss(D, x: tr.tensor, y: tr.tensor):
"""
Implements the laplace loss for the discriminator layers.
Input shape: (N, C, H, W)
:return: the scalar loss value
"""
assert x.shape == y.shape, "The shape of inputs must be equal."
assert len(x.shape) == 4, "Input must be 4 dimensional."
_, x_acts = D(x)
_, y_acts = D(y)
losses = tr.stack([trf.l1_loss(x_l, y_l)*2**(-2*l) for l, (x_l, y_l) in enumerate(zip(x_acts, y_acts))])
loss = tr.sum(losses)
return loss
It seems like it is working, but I am not sure that this is the right way of implementing this. My question is that is this solution correct / is there a more elegant solution?
Thanks!