NaN Output From Loss Function

After a few passes through my network, the loss seems to explode exponentially until it reaches inf and then NaN the rest of the way through. I know I’m not the first to have these problems, so here is what I’ve already tried

  1. My input doesn’t contain any NaNs, I replaced them with the average of the df column
  2. I have tried NL1Loss and MSELoss and both have this problem
  3. My learning rate is 1E-5
  4. I have used anomaly detection, and sure enough it returns

Function ‘AddmmBackward’ returned nan values in its 2th output.

which it just tells me about a problem I already knew I had. Still don’t know how to fix it.

So here is some code from the model as it is right now:

class Crypto_Net(nn.Module):
	def __init__(self, n_input, n_hidden, n_output):
		# Define the architecture = nn.Sequential(
			# Layers 1 and 2 with 95 inputs, 20 outputs
			nn.Linear(n_input, n_hidden),
			# Activation function
			# Layers 2 and 3 with 20 inputs 5 outputs
			nn.Linear(n_hidden, n_output)

	def forward(self, volumes_sample):
		x =
		return x

def train(p, model, criterion, optimizer, train_loader):

	with autograd.detect_anomaly():
		# epoch is one full pass through dataset
		for epoch in range(p['epochs']):
			for i, (sample, target) in enumerate(train_loader):
				# Forward pass for each batch of volumes stored in train_loader
				model_output = model(sample)
				loss = criterion(model_output, target)
				# Backpropagation of error
				# computes new grad
				# update weights
	#			log.debug('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
	#				.format(
	#					epoch+1, p['epochs'], 
	#					i+1, len(train_loader), loss.item()
	#			))

				#import pdb; pdb.set_trace()

Thanks in advance for any help!
Also here’s the full code on gihub:

If you are using the CPU, could you check your numpy version, as we recently had an issue with an older numpy, which apparently created NaN outputs at one point?

I have version 1.18.2

Also yes I am using the CPU

Thanks for the information!
Could you post an executable code snippet using random inputs so that we can reproduce and debug this issue, please?

EDIT: Also, which PyTorch version are you using?


Also I can try to rearrange the code to get a snippet that recreates the problem, but will you download the dataset or should I make a smaller fake version as well?

Edit: I see that you said random inputs – I will try that

1 Like

Sorry to waste your time, I found the problem. Lesson to anyone else who reads this thread, write a test for the function that gets the nans out of your input because they’re probably still there!