Custom Dataset (Cannot load final batch of different size, cuDNN error all of a sudden)

tlim · March 19, 2021, 7:36am

Hi there,

My custom dataset code (multilabel classifier):

class MultiLabelDataset(Dataset):
    
    def __init__(self,csv_path,image_path,transform=None):
        super().__init__()
        
        self.data       = pd.read_csv(csv_path)
        self.labels     = np.asarray(self.data.drop(['ID'],axis=1))
        self.image_file = self.data.iloc[:,0]
        self.transforms = transform
        self.image_path = image_path
        
    def __getitem__(self, index):

        img_path = os.path.join(self.image_path,self.image_file[index])
        img = Image.open(img_path)
            
        if self.transforms is not None:
            img = self.transforms(img)
        
        labels = torch.from_numpy(self.labels[index]).float()
        
        return (img, labels)

    def __len__(self):
        return len(self.data)

Then the dataloader:

transformation = transforms.Compose([
                                     transforms.Resize([224,224]),transforms.ToTensor(),
                                     #transforms.Normalize(mean, std)
                                    ])
csv_path   = './main.csv'
image_path = './datasets' 
dataset = MultiLabelDataset(csv_path,image_path,transform=transformation)
train_loader = DataLoader(dataset,batch_size = 16, shuffle=True,drop_last=True)

Very simple set up, it works and trains well but the big issue is if I put drop_last=False then the whole thing will suddenly break (lots of issues). This error comes out (on the FINAL batch size which is of a different size due to drop_last=False):

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x89390e0
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
output: TensorDescriptor 0x8e63ad0
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 16, 128, 28, 28, 
    strideA = 100352, 784, 28, 1, 
weight: FilterDescriptor 0x82c75e0
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 128, 128, 3, 3, 
Pointer addresses: 
    input: 0x7f73e2000000
    output: 0x7f73e2620000
    weight: 0x7f73f9ee5000
Forward algorithm: 7

So I put this line of codes into a new notebook and run it:

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

and suddenly I can run training! However, switching to validating, the same error pops out again on the FINAL batch. I got no clue what is happening here! It took me some time to debug to figure out that what is causing it (FINAL batch) and then tried to have all same size (drop_last=True) to get it to finally work. What is happening here?

ptrblck · March 19, 2021, 7:44am

I assume the final batch size is smaller than 16 samples? If so, could you check the shape, since the reproduction code snippet claims the data tensor would still have a batch size of 16?

tlim · March 19, 2021, 7:47am

I checked the shape, the last batch is size 3 (others are 16). Typically, it still should work right (the batch size difference is to be expected for a lot of projects)?

ptrblck · March 19, 2021, 8:49am

Yes, it should work, but unfortunately doesn’t guarantee that cudnn might not hit an internal error.
Are you able to reproduce the issue by running the code snippet from the error message using this batch size? If so, which GPU are you using?

tlim · March 19, 2021, 9:07am

Code snippet does not return any error.

GPU: GTX 2080 , GTX 1080 (Tried on both)

I used ImageFolder before (dataset provided by PyTorch) and I have never face this issue, not too sure what is causing it now.

ptrblck · March 19, 2021, 7:46pm

Thanks for the update. Would it be possible to provide the model definition, so that we could try to reproduce this issue?
I don’t think it’s related to the dataset, but might be an internal cudnn error.

tlim · March 22, 2021, 1:01am

from torchvision import models

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = models.resnet18(pretrained=True) 

feature_extraction = False
if feature_extraction:
    for param in model.parameters():
        param.requires_grad = False

num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, len(labels))
model.to(device)

ptrblck · March 22, 2021, 4:15am

Thanks for the code!
I cannot reproduce the cudnn issue on a 2080 using the 1.8.0+CUDA10.2+cudnn7.6.5 and 1.8.0+CUDA11.1+cudnn8.0.5 conda binaries for all shapes in [1, 32]:

for bs in torch.arange(1, 33):
    x = torch.randn(bs, 3, 224, 224, device=device)
    out = model(x)
    print(out.device)
    print(out.shape)

tlim · March 22, 2021, 9:24am

I can run your script without any error, not too sure what is causing my error from my own Dataset/DataLoader previously.

*Updated Pytorch, I cannot reproduce the error.

tlim · March 23, 2021, 2:12am

funnily enough, next day I restarted the Jupyter notebook kernel and set drop_last=False and the same issue came back.

If anyone is having the same problem, please share.