(Updated) NVIDIA RTX A6000 INCOMPATIBLE WITH PYTORCH

tjk · November 5, 2021, 6:02am

Hello！
Several days ago I posted a issue about pytorch with NVIDIA RTX A6000 GPU, here is the origianl link:
Nvidia rtx a6000 gpu incompatible with pytorch - windows - PyTorch Forums

Many thanks to ptrblck, and I have tried to uninstall CUDA on our workstation and built CUDA 11.3 toolkit and pytorch from source using the following command:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
But before the finishing of first epoch, the same error occured:

Could you please offer any other suggestions about this issue? Thanks a lot.
PS: My classmates and I think the problem described by ptrblck was rather fancy, because we uninstalled CUDA 10.2 and installed CUDA 11.1 using binary file from MVIDIA official site. We then uninstalled CUDA 11.1 on windows 10 and built cuda and pytorch from source, but it didn’t solve the problem. We tried to run the same program on CUDA 11.1 and pytorch 1.8.1, it can run three epoches, and the error :RuntimeError: CUDA error: an illegal memory access was encountered would pop out.

ptrblck · November 5, 2021, 6:46am

Since you are seeing the cuDNN issue apparently in both setups, could you post an executable code snippet which would reproduce the illegal memory access on the A6000, please?

tjk · November 5, 2021, 7:56am

Of course we can, the underlined part of the following picture denote the position where error occured:

tjk · November 5, 2021, 7:56am

and the network structure is defined as follows:

tjk · November 5, 2021, 8:25am

ptrblck · November 5, 2021, 8:33am

Could you post it as a code snippet by wrapping it into three backticks ```?
Also, please post the shapes for all input tensors which are needed to execute the code and reproduce the issue.

tjk · November 5, 2021, 3:17pm

OK, the network and parameter updating process is defined as follows:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        #self.conv1 = nn.Conv2d(3, 6, 5)
        #self.fc1 = nn.Linear(72*12, 768)
        #self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(2*3*64*64, 80*80)
        self.fc2 = nn.Linear(80*80, 125*125)
        #self.fc3 = nn.Linear(2*3*16*16, 60*60)
        self.us = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
        self.conv1 = nn.Conv2d(1, 64, 5,padding=2) # would decrease size 4 in each dimension
        self.conv2 = nn.Conv2d(64, 32, 5,padding=2)
        self.conv3 = nn.Conv2d(32, 16, 5,padding=2)
        self.conv4 = nn.Conv2d(16, 4, 5,padding=2)
        self.conv5 = nn.Conv2d(4, 1, 5,padding=2)


    # With relu and wihout relu has similar result    
    def forward(self, x):
        x = x.view(-1, 2*3*64*64)
        #x = self.conv1(x)
        #x = x.view(-1, 72*12)
        ## Fully connected layers
        x = self.fc1(x)
        #x = F.relu(self.fc1(x))
        x = nn.LeakyReLU(0.1)(x)
        x = self.fc2(x)
        # x = F.relu(self.fc2(x))
        x = nn.LeakyReLU(0.1)(x)	
        x = torch.reshape(x, (np.shape(x)[0], -1, 125, 125)) # 
        ## Upsampling
        x = self.us(x)
        ## Five convolution layers
        x = nn.LeakyReLU(0.1)( self.conv1(x) )
        x = nn.LeakyReLU(0.1)( self.conv2(x) )
        x = nn.LeakyReLU(0.1)( self.conv3(x) )
        x = nn.LeakyReLU(0.1)( self.conv4(x) )
        x = self.conv5(x)
        x = x.view(np.shape(x)[0], -1)
        # x = self.conv1(x)
        return x
#Generate model
net = Net().double()
#Batch size too small, no parallelization, only 1 GPU
#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    print("Use GPU")
else:
    device = torch.device("cpu")
    print("Use CPU")



#Use all 3 GPUs
# if torch.cuda.device_count() > 1:
#     print("Let's use", torch.cuda.device_count(), "GPUs!")
#     net = nn.DataParallel(net)
#Put data on all GPUs

net.to(device)
#Learning Process
# Train the network
for epoch in range(nepoch):
    running_loss = 0.0
    current_loss.append(10)
    print('Epoch: %d\n' %(epoch))
    for i, data in enumerate(trainloader, 0):
        #print(i)
        inputs, labels = data[0].to(device), data[1].to(device)
        # Zero gradient
        optimizer.zero_grad()
        # Forward, backward(gradient), optimize (add gradient to weight)
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        # Print loss information
        running_loss += loss.item()
        current_loss[epoch] += loss.item()
        print(loss.item())
        if i % 4 == 3: # Print every 4 mini-batches
            #print('[%d, %3d] loss: %.3f' % (epoch+1, i+1, running_loss/9))
            scheduler.step(running_loss)
            loss_history.append(running_loss)
            running_loss = 0.0
    current_loss[epoch] = current_loss[epoch] - 10
    if epoch >=1:
        if current_loss[epoch] < min(current_loss[0:epoch]):
            PATH = '%s/Corrosion-conv-fc-%d.pth' % (fd, niter+1)
            torch.save(net.state_dict(), PATH)
    else:
        PATH = '%s/Corrosion-conv-fc-%d.pth' % (fd, niter+1)
        torch.save(net.state_dict(), PATH)

Shape of input tensors:
inputs: torch.Size([64,2,3,64,64])
labels: torch.Size([64,62500])
when it runs to outputs = net(inputs), the error occured

ptrblck · November 6, 2021, 1:02am

Thanks for the code!
I’ve let the model train for a few hours using PyTorch 1.9.1+cu113 as well as 1.8.2+cu111 on an A6000 using this code:

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import numpy as np


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        #self.conv1 = nn.Conv2d(3, 6, 5)
        #self.fc1 = nn.Linear(72*12, 768)
        #self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(2*3*64*64, 80*80)
        self.fc2 = nn.Linear(80*80, 125*125)
        #self.fc3 = nn.Linear(2*3*16*16, 60*60)
        self.us = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
        self.conv1 = nn.Conv2d(1, 64, 5,padding=2) # would decrease size 4 in each dimension
        self.conv2 = nn.Conv2d(64, 32, 5,padding=2)
        self.conv3 = nn.Conv2d(32, 16, 5,padding=2)
        self.conv4 = nn.Conv2d(16, 4, 5,padding=2)
        self.conv5 = nn.Conv2d(4, 1, 5,padding=2)


    # With relu and wihout relu has similar result    
    def forward(self, x):
        x = x.view(-1, 2*3*64*64)
        #x = self.conv1(x)
        #x = x.view(-1, 72*12)
        ## Fully connected layers
        x = self.fc1(x)
        #x = F.relu(self.fc1(x))
        x = nn.LeakyReLU(0.1)(x)
        x = self.fc2(x)
        # x = F.relu(self.fc2(x))
        x = nn.LeakyReLU(0.1)(x)	
        x = torch.reshape(x, (np.shape(x)[0], -1, 125, 125)) # 
        ## Upsampling
        x = self.us(x)
        ## Five convolution layers
        x = nn.LeakyReLU(0.1)( self.conv1(x) )
        x = nn.LeakyReLU(0.1)( self.conv2(x) )
        x = nn.LeakyReLU(0.1)( self.conv3(x) )
        x = nn.LeakyReLU(0.1)( self.conv4(x) )
        x = self.conv5(x)
        x = x.view(np.shape(x)[0], -1)
        # x = self.conv1(x)
        return x
#Generate model
net = Net().double()
#Batch size too small, no parallelization, only 1 GPU
#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    print("Use GPU")
else:
    device = torch.device("cpu")
    print("Use CPU")


x = torch.randn(64*100, 2, 3, 64, 64).double()
y = torch.randn(64*100, 62500).double() 

dataset = TensorDataset(x, y)
trainloader = DataLoader(dataset, batch_size=64)
criterion = nn.MSELoss()

#Use all 3 GPUs
# if torch.cuda.device_count() > 1:
#     print("Let's use", torch.cuda.device_count(), "GPUs!")
#     net = nn.DataParallel(net)
#Put data on all GPUs

net.to(device)
#Learning Process
# Train the network
nepoch = 4
optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)
for epoch in range(nepoch):
    print('Epoch: %d\n' %(epoch))
    for i, data in enumerate(trainloader, 0):
        #print(i)
        inputs, labels = data[0].to(device), data[1].to(device)
        # Zero gradient
        optimizer.zero_grad()
        # Forward, backward(gradient), optimize (add gradient to weight)
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        # Print loss information
        print(loss.item())

and could reproduce any issue.

tjk · November 6, 2021, 3:09am

Thanks, but I am a little confused, you said: ''and could reproduce any issue", did you mean you ran the code completely and no problem was found?

ptrblck · November 6, 2021, 6:06am

Ah sorry… missing word. Yes, I executed the posted code and was not able to run in any issue. The code worked for ~5 hours without any problems on the A6000.

tjk · November 6, 2021, 11:10am

Could you please leave an e-mail so that I can send the complete code to you?

ptrblck · November 6, 2021, 6:36pm

Could you post the code in a GitHub Gist, please?
Also, are you able to run into the error using my posted code snippet?

tjk · November 8, 2021, 2:55am

Yes, I run into error with your code snippet. Do you know how to build cuda from source in windows10? I see the guide on pytorch and it seems that build from source in ubuntu is easy, but build in windows 10 is complicated.

tjk · November 8, 2021, 3:43am

Since the computer environment of our workstation is windows10 and we use spyder to run the program, is it possible that the error occured because pytorch is trying to use the memory allocated for windows OS?

ptrblck · November 8, 2021, 7:15am

I don’t know enough about Windows and its specifics to be able to speculate about the root cause of the issue. Could you try to create the cuDNN API logs as described here and send them to me?

tjk · November 10, 2021, 9:48am

OK，We are producing the cudnn log according to instructions. By the way，can this A6000 GPU support ubuntu 16.04LTS? It seems that install the cuda11.x driver is possible but the manual of a6000 says the minimum OS requirement is ubuntu 18.

ptrblck · November 10, 2021, 10:49am

I don’t know if an older Ubuntu could cause issues. Since 16.04LTS’ EOL was April 30th 2021 I would generally recommend to update to a supported OS. For my test I’ve been using 20.04LTS.

tjk · November 11, 2021, 12:38pm

Hello!
The Cudnn log file is produced and sent to the moderators as private messages. The log produced with the code snippet above ran for one epoch is also uploaded on github.

ptrblck · November 12, 2021, 6:52am

Thanks! Let me check it. Btw. I assume your workload is not “private” and you’ve thus sent it.

tjk · November 16, 2021, 11:05am

Hello，have you found out the cause of the problem? Thanks a lot，by the way.