MSELoss producing NaN on the second or third input each time

marcman411 · October 28, 2017, 5:36am

I am using the MSE loss to regress values and for some reason I get nan outputs almost immediately. The first input always comes through unscathed, but after that, the loss quickly goes to infinity and the prediction comes out as a matrix nan. Why might this be happening?

I’ve checked my inputs and GT and those values are correct and not all 0’s. My training loop is below:

# MSE loss
c_label_criterion = nn.MSELoss(size_average=True).cuda()

# Instantiate optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

# Run all epochs
for ep in xrange(num_epochs):

	# Shuffle the indices
	random.shuffle(indices)

	# Run though the dataset completely
	for i in indices:
		# Load batch of data
		img, c_gt = dataset.next_input() # B X 3 x M X N image, B x 2 x M x N ground truth

		# Put the data on the GPU
		img = img.cuda()

		#------------------------------------------------------------------
		# Run a forward pass
		#------------------------------------------------------------------
		# Zero out the gradients		
		optimizer.zero_grad()

		# Forward pass
		c_pred = model(img)


		#------------------------------------------------------------------
		# Backpropagate
		#------------------------------------------------------------------
		# Compute component losses
		c_loss = c_label_criterion(c_pred, c_gt)

		# Perform backpropagation
		c_loss.backward()

		# Update the weights
		optimizer.step()

I initialize my model weights (I’m training from scratch) with the following:

# Initialize conv weights with Gaussian random values
for m in model.modules():
	if isinstance(m, nn.Conv2d):
		n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
		m.weight.data.normal_(0, math.sqrt(2. / n))

My architecture is basically FCN but with some minor alterations. I don’t perform batch normalization anywhere in the network. My GT data is in the range [0,1], but the predictions after the first input are on the order of 100’s. For example, after one input, the MSE loss value (with size_average=True) is 1022004.8750. I thought perhaps I was dealing with exploding gradients, but I have ReLUs throughout. Here is an example block:

	self.conv_block4 = nn.Sequential(
		nn.Conv2d(256, 256, 3, padding=1), # 256 in, 256 out
		nn.ReLU(inplace=True), # ReLU nonlinearity
		nn.Conv2d(256, 256, 3, padding=1), # 256 in, 256 out
		nn.ReLU(inplace=True), # ReLU nonlinearity
		nn.Conv2d(256, 256, 3, padding=1), # 256 in, 256 out
		nn.ReLU(inplace=True), # ReLU nonlinearity
		nn.MaxPool2d(2, stride=2, ceil_mode=True)

Could anyone shed some light on what may be occurring?

richard · October 28, 2017, 5:03pm

If the loss is going to infinity does making your learning rate smaller help?

Is the loss actually going to infinity? There is an infinity value inf, but nan is something different. For example, if you take the log of a negative number you’ll get a nan, but if you divide 1 by 0 you get inf (inside tensors).

marcman411 · October 28, 2017, 5:14pm

I’m seeing the loss go to inf and the predictions all become nan. I tried shrinking the learning rate way down but that made no difference.

It seems I needed to normalize my RGB values to [0,1]. It looks like the gradients were just way too large otherwise.

mratsim · October 28, 2017, 5:41pm

Yes, CNNs learn best with pixel values / 255

Furthermore you also probably should normalize by channel so that your values distribution is centered around 0 and standard deviation is 1.

Here is a script to do that for PyTorch.

Note: for pretrained network on ImageNet you should use ImageNet mean and stddev:

transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

ikrima · March 14, 2019, 2:03am

For regression, I used loss as nn.MSELoss(). But, after 3,4 iterations of loss, the weights of final layer got to infinity. Resultantly, loss and output also went to infinity / Nan. When I switched nn.L1Loss(), everything is fine. I tried different number of input samples and also normalized all the input data. Any thoughts?

ptrblck · March 14, 2019, 12:29pm

Could you try to lower the learning rate and see, if the loss decreases using nn.MSELoss?
Also, could you check the shapes of your model output and the target before passing it to the criterion?

WERush · April 22, 2019, 1:47pm

I also met this issue. nn.L1Loss() works fine but nn.functional.l1_loss() delivers the Nan. So weird

ptrblck · April 23, 2019, 9:56am

Could you post a reproducible code snippet so that we could have a look?