How to implement validation loss in VAE training session

I am trying to calculate the validation loss along with the training set loss while training a simple VAE network. But, I am receiving a CUDA error as follows:

<ipython-input-27-5638e9c724e6> in unSupTrain(epoch)
     22             recon_batch, mu, sigma = model(data)
     23             # Get valloss value
---> 24             vallossitem = elbo(recon_batch, data, mu, sigma).item()
     25             # Append loss to history
     26             hist_validation += vallossitem

RuntimeError: CUDA error: device-side assert triggered

It probably occurs because of the loss function setup but I somehow couldn’t find any solution for separating training and validation loss.

Here is the elbo function:

def elbo(recon_x, x, mu, sigma):
    '''Loss function.'''
    # Reshape the input
    x = x.view(-1, INP_SIZE)

    # Binary cross entropy
    RE = F.binary_cross_entropy(recon_x, x, reduction='sum')
    # KL divergence
    KL = F.kl_div(recon_x, x, reduction='sum', log_target=True)
    # Return the loss
    return RE - KL

and training function (certain parts of validation and training loss calculation):

# ---------- Validation -----------
    # Don't track gradients
    with torch.no_grad():
        # Trace
        print('Running Validation')
        for batch_idx, (data, _) in enumerate(valloader):
            # Convert to cuda if possible
            data = data.to(device=DEVICE)

            # Forward
            recon_batch, mu, sigma = model(data)
            # Get valloss value
            vallossitem = elbo(recon_batch, data, mu, sigma).item()
            # Append loss to history
            hist_validation += vallossitem
            # Round loss
            vallossitem = round(vallossitem / len(data), 3)

# -------- Training -------- #
    for batch_idx, (data, _) in enumerate(trainloader):
        # Convert to cuda if possible
        data = data.to(device=DEVICE)

        # Zero grad
        optimizer.zero_grad()
        # Forward
        recon_batch, mu, sigma = model(data)
        # Loss calculation
        trainloss = elbo(recon_batch, data, mu, sigma)
        # Backward
        trainloss.backward()
        # Get trainloss value
        trainlossitem = trainloss.item()
        # Append loss to history
        hist_training += trainlossitem

I would appreciate any suggestions. Thanks in advance!

1 Like

Could you rerun the code via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the complete stack trace here?
Alternatively, you could also run the script on the CPU, which might yield a better error message.

1 Like

I doubted this elbo function. Usually, in the examples, we have the loss function implemented in the following way:

def loss_function(recon_x, x, mu, logvar):
    BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction="sum")

    # see Appendix B from VAE paper:
    #  Kingma and Welling. Auto-Encoding Variational Bayes. ICLR, 2014
    # https://arxiv.org/abs/1312.6114
    # 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return BCE + KLD, BCE, KLD

I wanted to reproduce this using the torch.nn.functional.kl_div function, which is similar to what OP implemented in elbo:

def loss_function(recon_x, x, mu, logvar):
    BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction="sum")

    # see Appendix B from VAE paper:
    # Kingma and Welling. Auto-Encoding Variational Bayes. ICLR, 2014
    # https://arxiv.org/abs/1312.6114
    # 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    
    KLD = torch.nn.functional.kl_div(
        recon_x, 
        x.view(-1, 784), 
        reduction='batchmean', 
        log_target=False
    )

    return BCE + KLD, BCE, KLD

But, I wonder — Is this correct? I ask that because the KLD should measure the divergence between the mean of the posterior (latent distribution) with respect to the mean of the prior (normal distribution). One can see that this does not work because reconstructions are sharp, but the generative aspect of the VAE is completely lost compared to the loss when you use

KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()).

I think that kl_div should be receiving mu that comes from the latent distribution, and for the target, it should be receiving the eps from the reparameterize function?

def reparameterize(self, mu, logvar):
    
    std = torch.exp(0.5 * logvar)
    eps = torch.randn_like(std)

    return mu + eps * std, eps

PS. I probably should have opened a new post for this, but it made sense here in my mind.

Should the output be given to the loss_function?
loss_function(output)? Did I get that right? The output I have received is a tuple that has three tensor elements with [1,3,32,32], 2x [1,512]

I couldn’t pass the output directly, I think because it is a tuple.