NaN Output From Loss Function

After a few passes through my network, the loss seems to explode exponentially until it reaches inf and then NaN the rest of the way through. I know I’m not the first to have these problems, so here is what I’ve already tried

  1. My input doesn’t contain any NaNs, I replaced them with the average of the df column
  2. I have tried NL1Loss and MSELoss and both have this problem
  3. My learning rate is 1E-5
  4. I have used anomaly detection, and sure enough it returns

Function ‘AddmmBackward’ returned nan values in its 2th output.

which it just tells me about a problem I already knew I had. Still don’t know how to fix it.

So here is some code from the model as it is right now:

class Crypto_Net(nn.Module):
	def __init__(self, n_input, n_hidden, n_output):
		super().__init__()
		# Define the architecture
		self.net = nn.Sequential(
			# Layers 1 and 2 with 95 inputs, 20 outputs
			nn.Linear(n_input, n_hidden),
			# Activation function
			nn.LeakyReLU(),
			# Layers 2 and 3 with 20 inputs 5 outputs
			nn.Linear(n_hidden, n_output)
			)

	def forward(self, volumes_sample):
		x = self.net(volumes_sample)
		return x

def train(p, model, criterion, optimizer, train_loader):

	with autograd.detect_anomaly():
		# epoch is one full pass through dataset
		for epoch in range(p['epochs']):
			for i, (sample, target) in enumerate(train_loader):
				# Forward pass for each batch of volumes stored in train_loader
				model_output = model(sample)
				loss = criterion(model_output, target)
				# Backpropagation of error
				optimizer.zero_grad()
				# computes new grad
				loss.backward(retain_graph=True)
				# update weights
				optimizer.step()
	#			log.debug('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
	#				.format(
	#					epoch+1, p['epochs'], 
	#					i+1, len(train_loader), loss.item()
	#			))

				#import pdb; pdb.set_trace()


Thanks in advance for any help!
Also here’s the full code on gihub: https://github.com/lollyi/crypto_net

If you are using the CPU, could you check your numpy version, as we recently had an issue with an older numpy, which apparently created NaN outputs at one point?

I have version 1.18.2

Also yes I am using the CPU

Thanks for the information!
Could you post an executable code snippet using random inputs so that we can reproduce and debug this issue, please?

EDIT: Also, which PyTorch version are you using?

1.4.0

Also I can try to rearrange the code to get a snippet that recreates the problem, but will you download the dataset or should I make a smaller fake version as well?

Edit: I see that you said random inputs – I will try that

1 Like

Sorry to waste your time, I found the problem. Lesson to anyone else who reads this thread, write a test for the function that gets the nans out of your input because they’re probably still there!