Occasional Nan Loss using custom CUDA kernels

Hi,

I implemented a RNN with custom fused kernel using CUDA. When I train it using manual seed, it would sometimes produce NaN loss from the very beginning (first batch). However, other times when I run the same program, the losses are real.

Also this only seems to be an issue when I use my custom CUDA kernel. When I use the Python version (using PyTorch functions), it doesn’t occur. Even if it’s running on GPU.

The stack trace is shared below. Any advice would be greatly appreciated.

/home/wz1232/anaconda3/lib/python3.7/site-packages/torch/serialization.py:292: UserWarning: Couldn’t retrieve source code for container of type CrossEntropyLoss. It won’t be checked for correctness upon loading.
"type " + obj.name + ". It won’t be checked "
/opt/conda/conda-bld/pytorch_1573049306803/work/torch/csrc/autograd/python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
File “train_adam.py”, line 389, in
global_step = train(global_step)
File “train_adam.py”, line 329, in train
loss = criterion(output.view(-1, ntokens), targets)
File “/home/wz1232/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/wz1232/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py”, line 916, in forward ignore_index=self.ignore_index, reduction=self.reduction)
File “/home/wz1232/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py”, line 2009, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File “/home/wz1232/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py”, line 1317, in log_softmax
ret = input.log_softmax(dim)

Traceback (most recent call last):
File “train_adam.py”, line 389, in
global_step = train(global_step)
File “train_adam.py”, line 330, in train
loss.backward()
File “/home/wz1232/anaconda3/lib/python3.7/site-packages/torch/tensor.py”, line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/wz1232/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function ‘LogSoftmaxBackward’ returned nan values in its 0th output.
terminate called without an active exception
Aborted (core dumped)