I have the same error, but I have a different issue than everyone else. Here’s a simple code that reproduces my issue:
import torch
import torch.nn as nn
import numpy as np
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.base_n_filter = 8
self.conv1 = nn.Conv3d(1, self.base_n_filter, 3, stride=1, padding=1)
self.conv2 = nn.Conv3d(self.base_n_filter, self.base_n_filter*2, 3, stride=1, padding=1)
self.conv3 = nn.Conv3d(self.base_n_filter*2, 1, 3, stride=1, padding=1)
def forward(self, img):
output1 = self.conv1(img)
output2 = self.conv2(output1)
output3 = self.conv3(output2)
return output3
if __name__ == '__main__':
img = torch.zeros([1,1,248,248,140]).cuda()
label = torch.rand_like(img, device=img.device)
model = Net().cuda()
output = model(img)
loss_fn = nn.MSELoss()
loss = loss_fn(output, label)
loss.backward()
The error only occurs when I have an input image of a specific size. For example, if I change the input to shape [1,1,250,250,150], the error no longer occurs. So it doesn’t seem like an overclocking issue. Additionally, this error also only occurs when self.base_n_filter is 8 or greater. I’m not using any lists to store my convolutions, and have set my model to cuda.
I am also using up-to-date GPU and cuda. The error occurs on both Tesla K80 and GTX1080Ti, with pytorch 1.2 cudatoolkit=10.0, CUDA/10.0.130, and cudnn/7.6.2.24-CUDA-10.0.130. Also, this error only occurs on Linux machines.
Any help would be truly appreciated!