CVAE after 38 Epochs is dropping to NaN in the loss function

Good morning,

I hope everyone is staying safe and isolating as much as possible/required…

I have a problem with a 1D CVAE i am creating. no matter what I do after 38 or 39 epochs I always get a NaN value in the loss function when using the BCEwithLogitsLoss.

I have a feeling that it is to do with the loss function calculation or i am doing something wrong in setting the model up, but i really cannot figure it out.

Loss Function

def sample(self, eps=None):
      if eps is None:
          eps = torch.randn(1, self.lat_dim)
          print("eps=", eps)
     return self.decode(eps, apply_sigmoid=True)

def loss_fn(model, data):
    mean, logvar = model.encode(data)
    z2=model.reparm(mean,logvar)
    out=model.decode(z2)

    criterion = torch.nn.BCELoss(size_average=False,reduce=False, reduction='sum')
    #criterion = torch.nn.BCEWithLogitsLoss(size_average=False,reduce=False, reduction='sum')
    BCE=criterion(out,data)
    logpx_z=-torch.sum(BCE,[1,2],keepdim=False)
    #logpx_z=-torch.sum(BCE,2,keepdim=True)
    logpz=log_normal_pdf(z2,torch.tensor(0.),torch.tensor(0.))
    logqz_x=log_normal_pdf(z2, mean, logvar)
    mean=logpx_z+logpz-logqz_x
    
    loss=-torch.mean(mean)
    return logvar,mean,loss,out,logqz_x,logpz,logpx_z,z2

Model Activation

        data = data.cuda()
        optimizer.zero_grad()
        logvar,mean,loss,out,logqz_x,logpz,logpx_z,z2 = loss_fn(model, data)
        loss.backward()
        optimizer.step()

as a quick asside the loss function starts to drop to zero but never drops below 300 and always has a grad_fn=

I have wondered if the problem is in loss.backward() & optimizer.Step()…

I hope someone can spot the error…

Many thanks & stay safe everyone

chaslie

Since this issue seems to be reproducible, could you store the model output and target, which creates the NaN value and check their values?

Hi ptrblck,

I hope you are well.

I have looked at the output from the last 3 epochs before failing and the only thing that looks strange is in the epoch before NaN appears a single value in the mean tensor goes to NaN, though there should be no reason for this.

chaslie

ps. this is related to my other post about cuda run time, which i am currently looking into.

Just a quick sanity check- Make sure your input features are normalized.
You can use this transformation on your features if required-

features = (features - features.min())/(features.max() - features.min())

It will normalize your features and makes it’s minimum and maximum value 0 and 1 respectively.

hi Braindotai,

Thanks for this, I have normalised the inputs using:

def normalize(x):
    x_normed = x / x.max(1, keepdim=True)[0]
    return x_normed

chaslie

Is the mean value calculated using these lines of code?

    logpx_z=-torch.sum(BCE,[1,2],keepdim=False)
    logpz=log_normal_pdf(z2,torch.tensor(0.),torch.tensor(0.))
    logqz_x=log_normal_pdf(z2, mean, logvar)
    mean=logpx_z+logpz-logqz_x

If so, could you try to store the activations and upload them so that we could try to reproduce this issue?