Cuda errors for inputs above size 255x255x4

I am experiencing a strange issue regarding input size. While my network runs fine with 3D input size 255x255x8, reaching 256 will result in an error (cuDNN error: CUDNN_STATUS_EXECUTION_FAILED) even after halving the third dimension. At first I would have thought this was a memory issue, but 255x255x8 is a much larger input than 256x256x4. In fact I can increase the size to 255x255x10 and it will still run.

Is anyone aware of any specific reason for this oddly specific threshold?

What are the input shapes of your network? Could you give more details about your cuDNN error?

I downsized the network to the minimum required to reproduce the error. This code works fine on CPU with the same input shapes, but fails to work on GPU if two dimensions are >= 256 and the third >=4. 256x256x100 works on CPU without any issues. I should probably send a bug report on github

import torch
import torch.nn as nn
torch.set_default_tensor_type('torch.cuda.FloatTensor')

class Net(nn.Module):
    
    def __init__(self):
        super(Net,self).__init__()
        self.layer1=nn.Conv3d(1,64,5,padding=2)  #nn.BatchNorm3d(1)
        
        self.layer2 = nn.Conv3d(64,1,1) #binary mask classifier
        self.sigmoid=nn.Sigmoid()
            
            
    def forward(self,in_vol):
        
        first = self.layer1(in_vol)

        MonoClass=self.layer2(first)
        
        Mask=self.sigmoid(MonoClass)
        
        return Mask #torch.cat([Mask, Classes],dim=1) #return Mask, Classes


A=Net().cuda()
batch=1

SideSize=256
Zsize=4
X=torch.randn(batch,1,SideSize,SideSize,Zsize)
FakeMask=torch.randn(batch,1,SideSize,SideSize,Zsize).cuda()
FakeMask[FakeMask>0.5]=1
FakeMask[FakeMask<0.6]=0
X=X.cuda()

optimizer=torch.optim.Adam(A.parameters())

optimizer.zero_grad()
Mask = A(X)

loss=torch.sum(Mask)
loss=loss
loss.backward()
optimizer.step()
print(loss)