Sending data to device in Pytorch causes it to contain nan values (rocm)

dunedin87 · April 17, 2023, 1:17am

Hello,

A bit of background,
Pytorch Version 2.0 ROCm version 5.4.2 on Linux Mint 20.3 (Ubuntu 20.04). Single AMD 6950 XT.

Initiating checking for cuda device results in the following warning.
~...local/lib/python3.8/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML")

torch.cuda.is_available() comes out as True

print(torch.cuda.get_device_name(device=device)) shows AMD Radeon Graphics

Onto to the problem, my loss and model outputs were showing nan, puzzled I went back and checked the data for nan, which also showed it to be True. This was odd as I had previously run the code on Kaggle notebooks, which didn’t show it to be a problem. I then decided to test it with random data and the data was still showing it containing nan. I then checked it before sending it .to(device) and after, and as expected initial data did not contain nan values but after .to(device) it did. Following is the code for testing.

device = torch.device("cuda" if (torch.cuda.is_available()) else "cpu")
model = ENet(3,2)
model.to(device)
criterion = nn.BCELoss()
opt = torch.optim.Adam(model.parameters(), lr = 1e-4)

for i in range(10):

  data = torch.rand(3,3,512,512).float()
  print('Initial Data nan', torch.any(torch.isnan(data)))
  labels =torch.randint(low = 0, high = 2, size = (3,2)).float()
  data = data.to(device)
  print('Device Data nan',torch.any(torch.isnan(data)))
  labels = labels.to(device)

  output = model(data)
  loss = criterion(output, labels)
  loss.backward()
  opt.step()
  opt.zero_grad()

The output for the first loop is

Initial Data nan tensor(False)

Device Data nan tensor(True, device='cuda:0')

Telling me that the randomly generated data doesn’t contain nan, but after .to(device) it does.

The problem is agnostic to the model or data (even doing to randomly generated data), but the following is a basic model for anyone wanting to replicate and troubleshoot. Uses the package timm for efficientnet. I have tried it on a Resnet model (not from timm but locally coded) with the same error.

class ENet(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(ENet, self).__init__()
        
        self.first_conv = nn.Conv2d(in_channels = in_channels, out_channels = 3, kernel_size = 1)
        self.backbone = timm.create_model('efficientnet_b0', pretrained = False)
        self.relu = nn.ReLU()
        self.classifier = nn.Linear(1000,num_classes)
        self.softmax = nn.Softmax(dim = -1)
        
    def forward(self, x):
        out = self.first_conv(x)
        out = self.relu(out)
        out = self.backbone(out)
        out = self.relu(out)
        out = self.classifier(out)
        out = self.softmax(out)
        return out

I have seen some other threads regarding data turning to nan when sending to device, but they were all on Cuda devices (nvidia) and none of the solutions reached there seemed to be working here. Also, as mentioned before, when I try the above code on a Kaggle notebook, which uses P100, it works, making it likely that the problem is with ROCm?
What is weird is that the problem isn’t even model weights / outputs or losses, but simple data transferring to device making it have nans.

Other topics wtih similar problems for reference. Any help would be appreciated, thanks.