CGAN image+labels to linear layer returns NaN

I am building a discriminator for a conditional GAN which consists of 2 components:

  • self.main_module which reduces an image from 256x256 to 64x64 (this was previously the only module in a patch-GAN)

  • self.head which takes a flattened output from the above module, concats it with one-hot labels and uses linear layers to get a single value output.

The problem is that after a certain number of steps, self.head returns NaN values while none of the inputs contain NaN values.

Around 400 classes are used in the one-hot labels. Could this be too sparse? How can this be fixed?

Full code:

class Discriminator(nn.Module):
    def __init__(self, in_channels=3, n_classes=0):
        super().__init__()

        self.main_module = nn.Sequential(
            nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=3, stride=1, padding=3//2),
            nn.LeakyReLU(0.2, inplace=True),

            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=2, padding=3//2),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(128, affine=True),
            nn.LeakyReLU(0.2, inplace=True),

            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=2, padding=3//2),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(256, affine=True),
            nn.LeakyReLU(0.2, inplace=True),

            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(256, affine=True),
            nn.LeakyReLU(0.2, inplace=True),

            nn.Conv2d(in_channels=256, out_channels=1, kernel_size=3, stride=1, padding=3//2),
            # Out is 1x64x64
        )

        self.head = nn.Sequential(
            nn.Linear(64*64 + n_classes, 1024),
            nn.BatchNorm1d(1024, affine=True),
            nn.LeakyReLU(0.2, inplace=True),

            nn.Linear(1024, 256),
            nn.BatchNorm1d(256, affine=True),
            nn.LeakyReLU(0.2, inplace=True),

            nn.Linear(256, 1),
        )

    def forward(self, images, labels):
        x = self.main_module(images)
        x = torch.cat([x.view(x.shape[0], -1), labels], dim=1)
        x = self.head(x)
        return x

So I’ve tested your code, and there may be an error with your nn.BatchNorm1d, I get a ValueError. I believe you should use nn.BatchNorm1d(1, affine=True) since you only have one channel after your linear layer.

Thanks for testing my code. I do not get any ValueErrors.

My original implementation only had linear layers (no batchnorm or ReLU) and it still failed.

Changing it to nn.BatchNorm1d(1, affine=True) did not work and throws the error:
RuntimeError: running_mean should contain 64 elements not 1
So I do believe my usage was correct initially.

I will try to train without BatchNorm but only LeakyReLU and Linear layers. Did not help.

BTW I am using BCEWithLogitsLoss, so I am running the discriminator without any final activation function. Might have a hand in the issue.

What are your images input shape and labels shape ?
Is it (N,3,256,256) and (N,num_classes) ?

Yes. So generally it is (32,3,256,256) and (32,~400).

It should be noted again that it works fine for about 150-200 batches, then it collapses.

Use try: ... except: ... To recuperate the specific data that is going wild. Also, make sure to keep your BatchNorm off, it really is not supposed to give you such error.
Update: your input of a BatchNorm1D should be (N,C, L) and not (N,L) which is your case.

Good idea, I tried using .isnan().any() prior to posting the thread, since an error is never actually thrown. And nothing anomalous was found in the input data.
But maybe it is also needed to look at the gradients… difficult problem :confused:

your input of a BatchNorm1D should be (N,C, L) and not (N,L) which is your case.

That’s true. Do you know if this is the case for ReLU aswell? Preferably I wouldn’t need any of them (norm and ReLU), I simply added them to try to mitigate this very problem.

LeakyRelu/ReLU doesn’t need to know anything about the shape of the input, so it’s good ;).
Keep at least the Leaky ReLu in place, otherwise your linear layers stacked together simply become a basic perceptron (i.e a single linear layer).

What does your training process look like ?

It is pretty long so I don’t want to write it all here. But it has worked very robustly prior to adding the labels as an additional input.

It’s the standard loop of training this discriminator, training the generator and can be reduced to:

G.eval()
D.train()

output = G(inputs, labels)

target_pred = D(target, labels)
output_pred = D(output, labels)

loss_D_target = BCE_loss(target_pred, torch.ones_like(target_pred))
loss_D_output = BCE_loss(output_pred, torch.zeros_like(output_pred))
loss_D = (loss_D_target + loss_D_output).mean()
loss_D.backward()

G.train()
D.eval()
optimizer_G.zero_grad()

output = G(inputs, labels)
output_pred = D(output, labels)

loss_G_VGG = VGG_loss(output, target)
loss_G_L1 = L1_loss(output, target)
loss_G_adv = BCE_loss(output_pred, torch.ones_like(output_pred))
loss_G = (loss_G_VGG + loss_G_L1 + loss_G_adv).mean()
loss_G.backward()

I changed forward() to this to debug:

    def forward(self, images, labels):
        x = self.main_module(images)
        x = x.view(x.shape[0], -1)
        x = torch.cat([x, labels], dim=1)
        y = self.head(x)
        print(x[0])
        return y

And tensors generally look good:

tensor([-0.0892, -0.0673, -0.1182, ..., 0.0000, 0.0000, 0.0000], device='cuda:0', grad_fn=<SelectBackward>)

But then the first part in x (i.e. the convolved images) turn into inf values:

tensor([inf, inf, inf, ..., 0., 0., 0.], device='cuda:0', grad_fn=<SelectBackward>)

Those values generally hovered around 0.1 otherwise but just spiral in a single batch.

I’m no back propagation expert, but intuitively I’d say that the optimizer cannot back propagate the gradients through the labels :confused:

The problem was not in the Discriminator but rather in an unstable module of the generator (which previously worked fine and appeared stable, but in reality was not).

The architecture described above seems to be working fine.

1 Like