Convolutional VAE not training

I have been trying to implement convolutional VAE in PyTorch for a while now and am somehow not able to correctly train my network. Here’s the encoder, decoder, and training loop. I am training the model on MNIST dataset.

Encoder output: Two tensors (loc, logvar) of shape [batch_size, latent_dims]

Decoder output: Image of shape [batch_size, 1, 28, 28]

Problem: Loss remains almost constant.

Encoder
class LocLogvar(nn.Module):
    def __init__(self, in_features, latent_dims):
        super(LocLogvar, self).__init__()
        self.loc = nn.Linear(in_features, latent_dims)
        self.logvar = nn.Linear(in_features, latent_dims)

    def forward(self, inputs):
        loc = self.loc(inputs)
        logvar = self.logvar(inputs)
        return loc, logvar

encoder = nn.Sequential(OrderedDict([
        ('e_conv_layer_1', nn.Conv2d(1, 16, 5, 1)),                # 16 x 24 x 24
        ('e_relu_layer_1', IPLReLU()),
        ('e_batch_norm_1', nn.BatchNorm2d(16)),
        ('e_conv_layer_2', nn.Conv2d(16, 32, 5, 1)),               # 32 x 20 x 20
        ('e_relu_layer_2', IPLReLU()),
        ('e_batch_norm_2', nn.BatchNorm2d(32)),
        ('e_conv_layer_3', nn.Conv2d(32, 32, 11, 1)),              # 32 x 10 x 10
        ('e_relu_layer_3', IPLReLU()),
        ('e_batch_norm_3', nn.BatchNorm2d(32)),
        ('e_conv_layer_4', nn.Conv2d(32, 64, 5, 1)),               # 64 x 6 x 6
        ('e_relu_layer_4', IPLReLU()),
        ('e_batch_norm_4', nn.BatchNorm2d(64)),
        ('e_dropout_layer_1', nn.Dropout2d(p=0.75)),
        ('e_conv_layer_5', nn.Conv2d(64, 128, 5, 1)),              # 128 x 2 x 2
        ('e_relu_layer_5', IPLReLU()),
        ('e_batch_norm_5', nn.BatchNorm2d(128)),
        ('e_dropout_layer_1', nn.Dropout2d(p=0.85)),
        ('e_flatten_layer', nn.Flatten()),
        ('out_layer', LocLogvar(128*2*2, latent_dims))
]))
Decoder
class Reshape(nn.Module):
    def __init__(self, *shape):
        super(Reshape, self).__init__()
        self.shape = shape

    def forward(self, X):
        return X.view(-1, *self.shape)

decoder = nn.Sequential(OrderedDict([
        ('inv_linear_layer_1', nn.Linear(latent_dims, 128*2*2)),   # 128 * 2 * 2
        ('inv_relu_layer_5', IPLReLU()),
        ('inv_flatten_layer', Reshape(128, 2, 2)),                 # 128 x 2 x 2
        ('inv_conv_layer_5', nn.ConvTranspose2d(128, 64, 5, 1)),   # 64 x 6 x 6
        ('inv_batch_norm_5', nn.BatchNorm2d(64)),
        ('inv_relu_layer_4', IPLReLU()),
        ('inv_conv_layer_4', nn.ConvTranspose2d(64, 32, 5, 1)),    # 32 x 10 x 10
        ('inv_batch_norm_4', nn.BatchNorm2d(32)),
        ('inv_relu_layer_3', IPLReLU()),
        ('inv_conv_layer_3', nn.ConvTranspose2d(32, 32, 11, 1)),   # 32 x 20 x 20
        ('inv_batch_norm_3', nn.BatchNorm2d(32)),
        ('inv_relu_layer_2', IPLReLU()),
        ('inv_conv_layer_2', nn.ConvTranspose2d(32, 16, 5, 1)),    # 16 x 24 x 24
        ('inv_batch_norm_2', nn.BatchNorm2d(16)),
        ('inv_relu_layer_1', IPLReLU()),
        ('inv_conv_layer_1', nn.ConvTranspose2d(16, 1, 5, 1)),     # 1 x 28 x 28
        ('inv_batch_norm_1', nn.BatchNorm2d(1)),
        ('out_layer', nn.Sigmoid())
]))
Loss function
def loss_fn(loc, logvar, reconstructed, img):
    reconstructed = reconstructed.view(-1, 784)
    img = img.view(-1, 784)
    recon_loss = -torch.sum(
        img*torch.log(reconstructed+1e-18) + (1-img)*torch.log(1-reconstructed+1e-18)
    )
    kl_loss = 0.5 * torch.sum(
        -logvar - 1 + logvar.exp() + loc**2
    )
    return recon_loss + kl_loss
Training Loop
encoder = encoder.to(torch.device('cuda'))
decoder = decoder.to(torch.device('cuda'))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for i in range(epochs):
    for idx in range(x_train.shape[0] // batch_size):
        # Get the batch to train
        x_batch = x_train[idx*batch_size:(idx+1)*batch_size, ...]

        # Forward pass through encoder
        loc, logvar = encoder(x_batch)

        # Reparameterize
        epsilon = torch.randn_like(loc)
        z = loc + torch.exp(logvar * 0.5) * epsilon

        # Forward pass through decoder
        reconstructed = decoder(z)
        loss = loss_fn(loc, logvar, reconstructed, x_batch)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print the results
        stdout.write(f'\r epoch : {i}\t'
                     f'step : {min((idx+1)*batch_size, x_train.shape[0])}/{x_train.shape[0]}\t'
                     f'loss : {loss.item():.3f}\t')
    print()
Output
 epoch : 0	step : 60000/60000	loss : 634711.188	
 epoch : 1	step : 60000/60000	loss : 635014.250	
 epoch : 2	step : 60000/60000	loss : 634935.625	
 epoch : 3	step : 60000/60000	loss : 635042.812	
 epoch : 4	step : 60000/60000	loss : 634130.562	
 epoch : 5	step : 60000/60000	loss : 634780.438	
 epoch : 6	step : 60000/60000	loss : 634427.250	
 epoch : 7	step : 60000/60000	loss : 634962.000	
 epoch : 8	step : 60000/60000	loss : 634118.750	
 epoch : 9	step : 60000/60000	loss : 636151.125

I suspect some gradients are getting detached somewhere but I am not sure what is causing this behaviour. It would be great if someone can suggest my mistake in the model :slight_smile:

I can’t spot any obvious error or a line of code, which might detach the tensors from the computation graph.
Just to make sure, you are indeed not detaching, you could print the gradients of all parameters after the backward pass:

for name, param in encoder.named_parameters():
    print(name, param.grad)
# same for decoder

If you see some None outputs, this would mean that these parameters won’t get gradients and thus won’t be updated.

1 Like

Hello @ptrblck,

First of all, thanks for your reply! I checked if any gradients were none, infinity or nan. Turns out that’s not the case still the model doesn’t train. But when I create a module with encoder and decoder in the module as submodules, it magically (for me) starts training as expected and produces really nice samples!! But I don’t get it. Shouldn’t the model train even if I have encoder and decoder as seperate modules and not shoved in nn.Module class? I may be wrong there as I have just started using torch.

My modified code:

Modified model
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()

        self.encoder = encoder = nn.Sequential(OrderedDict([
        ('e_conv_layer_1', nn.Conv2d(1, 16, 5, 1)),                # 16 x 24 x 24
        ('e_relu_layer_1', nn.LeakyReLU(inplace=True)),
        ('e_batch_norm_1', nn.BatchNorm2d(16)),
        ('e_conv_layer_2', nn.Conv2d(16, 32, 5, 1)),               # 32 x 20 x 20
        ('e_relu_layer_2', nn.LeakyReLU(inplace=True)),
        ('e_batch_norm_2', nn.BatchNorm2d(32)),
        ('e_conv_layer_3', nn.Conv2d(32, 32, 11, 1)),              # 32 x 10 x 10
        ('e_relu_layer_3', nn.LeakyReLU(inplace=True)),
        ('e_batch_norm_3', nn.BatchNorm2d(32)),
        ('e_conv_layer_4', nn.Conv2d(32, 64, 5, 1)),               # 64 x 6 x 6
        ('e_relu_layer_4', nn.LeakyReLU(inplace=True)),
        ('e_batch_norm_4', nn.BatchNorm2d(64)),
        ('e_dropout_layer_1', nn.Dropout2d(p=0.75)),
        ('e_conv_layer_5', nn.Conv2d(64, 128, 5, 1)),              # 128 x 2 x 2
        ('e_relu_layer_5', nn.LeakyReLU(inplace=True)),
        ('e_batch_norm_5', nn.BatchNorm2d(128)),
        ('e_dropout_layer_1', nn.Dropout2d(p=0.85)),
        ('e_flatten_layer', nn.Flatten()),
        ('e_out_layer', LocLogvar(128*2*2, latent_dims))
        ]))

        self.decoder = decoder = nn.Sequential(OrderedDict([
        ('inv_linear_layer_1', nn.Linear(latent_dims, 128*2*2)),   # 128 * 2 * 2
        ('inv_relu_layer_5', nn.LeakyReLU(inplace=True)),
        ('inv_flatten_layer', Reshape(128, 2, 2)),                 # 128 x 2 x 2
        ('inv_conv_layer_5', nn.ConvTranspose2d(128, 64, 5, 1)),   # 64 x 6 x 6
        ('inv_batch_norm_5', nn.BatchNorm2d(64)),
        ('inv_relu_layer_4', nn.LeakyReLU(inplace=True)),
        ('inv_conv_layer_4', nn.ConvTranspose2d(64, 32, 5, 1)),    # 32 x 10 x 10
        ('inv_batch_norm_4', nn.BatchNorm2d(32)),
        ('inv_relu_layer_3', nn.LeakyReLU(inplace=True)),
        ('inv_conv_layer_3', nn.ConvTranspose2d(32, 32, 11, 1)),   # 32 x 20 x 20
        ('inv_batch_norm_3', nn.BatchNorm2d(32)),
        ('inv_relu_layer_2', nn.LeakyReLU(inplace=True)),
        ('inv_conv_layer_2', nn.ConvTranspose2d(32, 16, 5, 1)),    # 16 x 24 x 24
        ('inv_batch_norm_2', nn.BatchNorm2d(16)),
        ('inv_relu_layer_1', nn.LeakyReLU(inplace=True)),
        ('inv_conv_layer_1', nn.ConvTranspose2d(16, 1, 5, 1)),     # 1 x 28 x 28
        ('inv_batch_norm_1', nn.BatchNorm2d(1)),
        ('inv_out_layer', nn.Sigmoid())
        ]))

    def encode(self, x):
        return self.encoder(x)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        return mu + eps*std

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

You are right and it shouldn’t make a difference, as the computation graph will be the same.

It might be a typo or copy-paste issue, but your optimizer in the first approach takes the parameters from model, which is undefined. It should get the parameters of encoder and decoder so that it can update these parameters using their gradients.
Could you check this particular line of code?

If this is indeed a typo, could you try to rerun both approaches for a couple of epochs using different seeds and check how reproducible the effect is, that the second approach trains better than the first?

Oh!!! That was a silly silly mistake! How did I not notice that!! So sorry to take your time. Thanks for the help! Cheers!

1 Like