Hi there,
My custom dataset code (multilabel classifier):
class MultiLabelDataset(Dataset):
def __init__(self,csv_path,image_path,transform=None):
super().__init__()
self.data = pd.read_csv(csv_path)
self.labels = np.asarray(self.data.drop(['ID'],axis=1))
self.image_file = self.data.iloc[:,0]
self.transforms = transform
self.image_path = image_path
def __getitem__(self, index):
img_path = os.path.join(self.image_path,self.image_file[index])
img = Image.open(img_path)
if self.transforms is not None:
img = self.transforms(img)
labels = torch.from_numpy(self.labels[index]).float()
return (img, labels)
def __len__(self):
return len(self.data)
Then the dataloader:
transformation = transforms.Compose([
transforms.Resize([224,224]),transforms.ToTensor(),
#transforms.Normalize(mean, std)
])
csv_path = './main.csv'
image_path = './datasets'
dataset = MultiLabelDataset(csv_path,image_path,transform=transformation)
train_loader = DataLoader(dataset,batch_size = 16, shuffle=True,drop_last=True)
Very simple set up, it works and trains well but the big issue is if I put drop_last=False
then the whole thing will suddenly break (lots of issues). This error comes out (on the FINAL batch size which is of a different size due to drop_last=False
):
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x89390e0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
output: TensorDescriptor 0x8e63ad0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 16, 128, 28, 28,
strideA = 100352, 784, 28, 1,
weight: FilterDescriptor 0x82c75e0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 128, 128, 3, 3,
Pointer addresses:
input: 0x7f73e2000000
output: 0x7f73e2620000
weight: 0x7f73f9ee5000
Forward algorithm: 7
So I put this line of codes into a new notebook and run it:
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 128, 28, 28], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
and suddenly I can run training! However, switching to validating, the same error pops out again on the FINAL batch. I got no clue what is happening here! It took me some time to debug to figure out that what is causing it (FINAL batch) and then tried to have all same size (drop_last=True
) to get it to finally work. What is happening here?