Get NaNs after Conv2d

Hi all, I get NaNs and Infs after the first convolution of my unet for a diffusion model. All parameters are randomly initialized at first, and the checkpoint I use to generate images have been trained for 30k iterations. Below is part of my unet code:

While initializing my unet:

self.init_conv = nn.Conv2d(channel, init_channel, 3, padding = 1)

Define the forward pass:

def forward(self, x, t, c):

if torch.sum(torch.isnan(x))>0: #Check if input x has NaNs
      print(f'Input x has NaNs at t: {t}')
      raise Exception()
if torch.sum(torch.isinf(x))>0: #Check if input x has Infs
      print(f'Input x has Infs at t: {t}')
      raise Exception()
      
h = self.init_conv(x)
    
for params in self.init_conv.parameters(): #Check the weights of the convolution
    if (torch.sum(torch.isnan(params.data))+torch.sum(torch.isinf(params.data))) > 0:
        print(f"params have nan or inf")
        raise Exception()
        
if torch.sum(torch.isnan(h))>0: #Check if output h has NaNs
      print(f'After init_conv h has NaNs at t: {t}')
      raise Exception()
      
if torch.sum(torch.isinf(h))>0: #Check if output h has Infs
      print(f'After init_conv h has Infs at t: {t}')
      raise Exception()
...

I add two if blocks to check if input x has NaNs or Infs and another two if blocks to do the same thing for convolution output h. A for loop is added to check if the weights of the convolution have NaNs and Infs. Below is the result.

shows that input x is okay, so are the weights of convolution layers. But after convolution, the output has Infs. Is it possible for getting Infs even the input and weights have no NaN or Inf?

Check your loss. Make sure your loss is computed only with pytorch tensors and NOT numpy functions. The tensors maintain a graph of their operations, and its possible to ‘lose’ the graph if you don’t do it correctly. Then when loss.backward() is called, no gradients or bad gradients are computed, and then very quickly things will go wrong.

Hi, Thanks for your reply. I didn’t use any numpy functions in my model, but used einops library to help me with the computation of tensors.
Below is part of my diffusion model class which is able to compute loss during training

Class diffusion(nn.Module):
    def __init__(self, **configs):
        self.loss_fn = nn.MSELoss() if loss_fn == 'L2' else nn.L1Loss() # I use mseloss during training
        ...
    def loss_simple(self, x, context):
        ...
        x_t, target_noise = self.q_sample(x, t)
        predict_noise = self.unet_model(x_t, t, context)
        return self.loss_fn(target_noise, predict_noise)
    ...
    def forward(self, x, context):
        return self.loss_simple(x, context)

Part of my training script. I used accelerate library to help my training.

self.accelerator = Accelerator(mixed_precision = 'fp16')
self.diffusion = diffusion(**configs)
total_loss = 0.
...
for _ in range(gradient_accumulation):
    ... receive (imgs, contexts) from dataloader
    with accelerator.autocast():
        loss = self.diffusion(imgs, contexts)
        loss = loss / gradient_accumulation
        total_loss += loss.item()
    pbar.set_description(f'Step:{s} Accumulated MSELoss:{total_loss:.4f}') 
    accelerator.backward(loss)

accelerator.clip_grad_norm_(self.diffusion.parameters(), 1.0)
accelerator.wait_for_everyone()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()

Check the output file

Then I search for nan and inf in this output file, but no pattern matches.