Loss becomes nan on cuda:0 but not cuda:1

I have the weirdest issue. For whatever reason when I am training my loss becomes nan when I use gpu 0 (cuda: 0) but trains fine when I use gpu 1 (cuda: 1). To prove this, I found a simple resnet implementation https://zablo.net/blog/post/using-resnet-for-mnist-in-pytorch-tutorial/.

from torchvision.models.resnet import ResNet, BasicBlock
from torchvision.datasets import MNIST
from tqdm.autonotebook import tqdm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import inspect
import time
from torch import nn, optim
import torch
from torchvision.transforms import Compose, ToTensor, Normalize, Resize
from torch.utils.data import DataLoader

class MnistResNet(ResNet):
    def __init__(self):
        super(MnistResNet, self).__init__(BasicBlock, [2, 2, 2, 2], num_classes=10)
        self.conv1 = torch.nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
        
    def forward(self, x):
        return torch.softmax(super(MnistResNet, self).forward(x), dim=-1)

def get_data_loaders(train_batch_size, val_batch_size):
    mnist = MNIST(download=True, train=True, root=".").train_data.float()
    
    data_transform = Compose([ Resize((224, 224)),ToTensor(), Normalize((mnist.mean()/255,), (mnist.std()/255,))])

    train_loader = DataLoader(MNIST(download=True, root=".", transform=data_transform, train=True),
                              batch_size=train_batch_size, shuffle=True)

    val_loader = DataLoader(MNIST(download=False, root=".", transform=data_transform, train=False),
                            batch_size=val_batch_size, shuffle=False)
    return train_loader, val_loader

def calculate_metric(metric_fn, true_y, pred_y):
    if "average" in inspect.getfullargspec(metric_fn).args:
        return metric_fn(true_y, pred_y, average="macro")
    else:
        return metric_fn(true_y, pred_y)
    
def print_scores(p, r, f1, a, batch_size):
    for name, scores in zip(("precision", "recall", "F1", "accuracy"), (p, r, f1, a)):
        print(f"\t{name.rjust(14, ' ')}: {sum(scores)/batch_size:.4f}")
if __name__ == "__main__":
    device = 'cuda:0'##torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
    epochs = 5

    model = MnistResNet().to(device)
    train_loader, val_loader = get_data_loaders(256, 256)

    losses = []
    loss_function = nn.CrossEntropyLoss()
    optimizer = optim.Adadelta(model.parameters())

    batches = len(train_loader)
    val_batches = len(val_loader)

    start_ts = time.time()
    # training loop + eval loop
    for epoch in range(epochs):
        total_loss = 0
        progress = tqdm(enumerate(train_loader), desc="Loss: ", total=batches)
        model.train()
        
        for i, data in progress:
            X, y = data[0].to(device), data[1].to(device)
            
            model.zero_grad()
            outputs = model(X)
            loss = loss_function(outputs, y)

            loss.backward()
            optimizer.step()
            current_loss = loss.item()
            total_loss += current_loss
            progress.set_description("Loss: {:.4f}".format(total_loss/(i+1)))
            
        # torch.cuda.empty_cache()
        
        val_losses = 0
        precision, recall, f1, accuracy = [], [], [], []
        
        model.eval()
        with torch.no_grad():
            for i, data in enumerate(val_loader):
                X, y = data[0].to(device), data[1].to(device)
                outputs = model(X)
                val_losses += loss_function(outputs, y)

                predicted_classes = torch.max(outputs, 1)[1]
                
                for acc, metric in zip((precision, recall, f1, accuracy), 
                                    (precision_score, recall_score, f1_score, accuracy_score)):
                    acc.append(
                        calculate_metric(metric, y.cpu(), predicted_classes.cpu())
                    )
            
            
            
        print(f"Epoch {epoch+1}/{epochs}, training loss: {total_loss/batches}, validation loss: {val_losses/val_batches}")
        print_scores(precision, recall, f1, accuracy, val_batches)
        losses.append(total_loss/batches)
    print(losses)
    print(f"Training time: {time.time()-start_ts}s")

On cuda:0 my loss hits nan running this code. But on cuda:1, it trains fine.

I am running my code with torch 1.5.0, nvidia driver version 440.33.01, CUDA version 10.2

Which devices are you using and is this issue reproducible?
Could you use torch.autograd.set_detect_anomaly(True) at the beginning of your script and post the error message for the NaN run?

In terms of devices I have 2 GeForce RTX 2080 Ti
image

In terms of “reproducible” the loss always ends up at nan but it will not occur at the same epoch. Sometimes it will be epoch 1 and others it will be at epoch 9. I’ve even tried adding random seeds, but it does not help.

output from using torch.autograd.set_detect_anomaly(True)

Traceback (most recent call last):
  File "train.py", line 71, in <module>
    loss.backward()
  File "/home/anaconda3/envs/torch_train_env/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/anaconda3/envs/torch_train_env/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

Regardless of the seed, you’ll always get a NaN as the output on one device, while the other device is always working?
If that’s the case, could you run some stress tests on both devices and a memory check, if possible?

yes thats correct.

what would be the best way to do this?

Thank you for helping me look into this. I hope we can get it resolved.

You could search for some stress tests for the device, and I cannot recommend a particular one.
Before doing that you could also swap the devices and check, if the connection might be faulty.

Hello, I’m seeing exactly the same problems. Using the same code base, one GPU (GTX 1050) runs without any problems, while the other one (GTX 1660) produces NaN values.
I stress tested the second GPU, but apparently everything seems to be ok with it.
In your case did you track down the cause of the problem? Was it because of a faulty GPU?
Any help would be appreciated!
Thanks a lot!