A question about Conditional GAN

Hello, I’m trying to implement a simple conditional GAN and wondering if what I’ve done is correct.

As far as I’ve understood, a conditional GAN is based on a simple architectural modification of the base GAN where we concatenate a suitable target vector of properties, or labels (so we end up performing a sort of semi-supervised training).

Currently, my model made by generator and discriminator looks like this:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from utils import data_generator, get_data_loaders
import pandas as pd

class Discriminator(nn.Module):
    def __init__(self, in_features):
        self.disc = nn.Sequential(
            nn.Linear(in_features, 128),
            nn.Linear(128, 1),

    def forward(self, x, y):
        input_ = torch.cat([x,y],dim=1)
        return self.disc(input_)
class Generator(nn.Module):
    def __init__(self, z_dim, comp_dim):
        self.gen = nn.Sequential(
            nn.Linear(z_dim, 256),
            nn.Linear(256, comp_dim),

    def forward(self, x,y):
        input_ = torch.cat([x,y],dim=1)
        return self.gen(input_)

Do you think that this looks correct?

I’m wondering because I’ve seen people constructing a nn.Embedding() starting from the vector that we are trying to condition on, like in this tutorial I’m following. I don’t really understand why is this the case. In my situation, for example, I have a vector of target properties that I wouldn’t convert into an embedding.

If your x is just a 1d vector and can be processed by Linear, simply concating it with label y is completely fine. But in the video you refer to, their input is an image, with height x width dimensions. So you cannot concat a 1d label to it (of course some people tile the 1d label to height x width, though I didn’t see much). Instead, they use learnable embeddings to convert the label to height x width shape tensors, and can be concated with input image x

This is really a case-by-case problem. Sometimes people find out directly concating the 1d label is not optimal, while making it one-hot or learn an embedding for it is better, so people do that. This is not a correct or wrong problem but just better empirical findings

Many thanks for you answer. What I still don’t understand 100% is, in the case of images, why people do use a learnable embedding if we already know what the label is gonna be for our inputs? If I have correctly understood what conditional GAN is doing, it is considering concatenation of our to fake/true inputs to the known targets/labels, so I don’t quite get why is this something that gets learned using an embedding.

Yes, there isn’t a strict answer. IMO, if the image features are very large (say 512 x 128 x 128), while the label is just a single number, then the network could have some optimization issues leveraging the label. Instead, if you learn embeddings to map label to say 128-dim, then it might be easier for the network to learn. But again, there is no theory here, just experience

1 Like