Cublas runtime error using dilated Conv3d but works on CPU or using common Convs

I was trying an ASPP-like module with dilated conv3d:

class _AtrousSpatialPyramidPoolingModule(nn.Module):

    def __init__(self, in_dim, reduction_dim=256, output_stride=16, rates=[6, 12, 18]):
        super(_AtrousSpatialPyramidPoolingModule, self).__init__()

        self.features = []
        self.features.append(
            nn.Sequential(nn.Conv3d(in_dim, reduction_dim, kernel_size=1, bias=False),
                          BatchNorm3d(reduction_dim),
                          nn.ReLU(inplace=True)))
        
        for r in rates:
            self.features.append(nn.Sequential(
                nn.Conv3d(in_dim, reduction_dim, kernel_size=3,
                          dilation=r, padding=r, bias=False),
                BatchNorm3d(reduction_dim),
                nn.ReLU(inplace=True)
            ))
        self.features = torch.nn.ModuleList(self.features)

    def forward(self, x):
        for f in self.features:
            y = f(x)
            out = torch.cat((out, y), 1)
        return out

It worked on CPU, but I got the following error on GPU:

RuntimeError: cublas runtime error : library not initialized at /tmp/pip-req-build-ocx5vxk7/aten/src/THC/THCGeneral.cpp:216

at the line y = f(x) when it ran to the second loop (dilated conv3d).

But when I changed dilated convs to common convs, it also worked:

        for r in rates:
            self.features.append(nn.Sequential(
                nn.Conv3d(in_dim, reduction_dim, kernel_size=3,
                          dilation=1, padding=1, bias=False),
                BatchNorm3d(reduction_dim),
                nn.ReLU(inplace=True)
            ))
        self.features = torch.nn.ModuleList(self.features)

My environment:

  • 4 Tesla V100-SXM2-32GB GPUs, no OOM scenario monitored (as I see it)
  • cudatoolkit 10.0.130
  • pytorch 1.1.0 (conda, also tried Pytorch 1.3.1)
  • python 3.6.7

Thank you for your time!

Could you rerun your code with CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here, please?

Also, could you post all arguments and input shapes, so that we can reproduce this issue?

Thank you for your reply!
I ran it with CUDA_LAUNCH_BLOCKING=1 and got the following error:

THCudaCheck FAIL file=/tmp/pip-req-build-jh50bw28/aten/src/THCUNN/vol2col.h line=64 error=77 : an illegal 
memory access was encountered
Traceback (most recent call last):
  File "train_segmentation.py", line 213, in <module>
    train(args)
  File "train_segmentation.py", line 123, in train
    model.forward(split='train')
  File "/myproject/models/feedforward_seg_model.py", line 127, in forward
    self.prediction = self.net(self.input)
  File "/myhome/.conda/envs/pytorch_seg/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/myproject/models/networks/unet_CT_V_edge_3D.py", line 67, in forward
    x = self.aspp(up1, acts)
  File "/myhome/.conda/envs/pytorch_seg/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/myproject/models/networks/unet_CT_V_edge_3D.py", line 327, in forward
    y = f(x)
  File "/myhome/.conda/envs/pytorch_seg/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/myhome/.conda/envs/pytorch_seg/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/myhome/.conda/envs/pytorch_seg/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/myhome/.conda/envs/pytorch_seg/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 476, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /tmp/pip-req-build-jh50bw28/aten/src/THCUNN/vol2col.h:64

But, weirdly, when I wrote a simple network with only 1 ASPP model above, and feed in randomly generated samples np.random.randn(32, 144, 144, 144).astype(np.float32), it worked.
I am very confused. Could you help me with the above error message?

It looks like a bug and we would need a code snippet to reproduce this issue.
Based on your last sentence it seems that random input data works fine, while your training workflow produces this error?
Would it be possible to save the particular input batch, which creates the issue, and upload it?

Thank you for your reply.

The backbone of ASPP is a V-Net, and the output of its final decoder layer is fed into the ASPP module. I call the input to ASPP up4, and it has shape (32, 144, 144, 144).

It’s strange that I tried the following operations and they also worked:

  • Take the feature map above up4, i.e. up3 with shape (64, 144, 144, 144), use its first 32 channels as ASPP input;
  • Take up4's symmetric feature map of the V-Net, down1, as ASPP input.
  • Save up4 of the main model, and feed it into my simple toy model;
    But if I change the data batch (random shuffle), the problem remains.

I cannot understand the error message, so to be honest, after those tests, I don’t know whether it’s a bug or it’s my bad :frowning:
I am gonna work on it to see if it can be reproduced in a minimal code snippet.
Again, thank you, prtblck!

If you are not using a custom CUDA extension, which might create this illegal memory access, then it’s a bug and we need to look into it and fix it.
Thanks for the debugging so far and please let us know, if we can assist somehow.

Hi, ptrblck. I’ve found the reason!

Please try the following code:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
import numpy as np

torch.backends.cudnn.benchmark = False
# torch.backends.cudnn.deterministic = False
torch.backends.cudnn.deterministic = True # crash

class Net(nn.Module):
    def __init__(self):
        super().__init__()

        self.aspp = _AtrousSpatialPyramidPoolingModule(32, 16, output_stride=16)
        # Add 3 conv layers
        self.conv_layers = nn.Sequential(
            nn.Conv3d(16*5,
                      1,
                      kernel_size=3,
                      stride=1,
                      padding=1),
            nn.ReLU(),
        )

    def forward(self, x):
        x = self.aspp(x)
        x = self.conv_layers(x)

        return x

class _AtrousSpatialPyramidPoolingModule(nn.Module):

    def __init__(self, in_dim, reduction_dim=256, output_stride=16, rates=[6, 12, 18]):
        super(_AtrousSpatialPyramidPoolingModule, self).__init__()

        if output_stride == 8:
            rates = [2 * r for r in rates]
        elif output_stride == 16:
            pass
        else:
            raise 'output stride of {} not supported'.format(output_stride)

        self.features = []
        # 1x1 no dilation
        self.features.append(
            nn.Sequential(nn.Conv3d(in_dim, reduction_dim, kernel_size=1, bias=False),
                          nn.BatchNorm3d(reduction_dim),
                          nn.ReLU(inplace=True)))
        # dilation
        for r in rates:
            self.features.append(nn.Sequential(
                nn.Conv3d(in_dim, reduction_dim, kernel_size=3,
                          dilation=r, padding=r, bias=False),
                nn.BatchNorm3d(reduction_dim),
                nn.ReLU(inplace=True)
            ))
        self.features = torch.nn.ModuleList(self.features)

        # img level features
        self.img_avg_pooling = nn.Sequential(
            nn.AdaptiveAvgPool3d(1),
            nn.Conv3d(in_dim, reduction_dim, kernel_size=1, bias=False),
            nn.ReLU(inplace=True))


    def forward(self, x):
        x_size = x.size()

        img_features = self.img_avg_pooling(x)
        img_features = F.interpolate(img_features, x_size[2:],
                                     mode='trilinear',align_corners=True)
        out = img_features

        for f in self.features:
            y = f(x)
            out = torch.cat((out, y), 1)
        return out


class Dataset3D(data.Dataset):
    def __init__(self, size=(32, 144, 144, 144)):
        self.data = np.random.randn(16, 32, 144, 144, 144).astype(np.float32)
    def __getitem__(self, index):
        return self.data[index]
    def __len__(self):
        return 16

# Training dataset
training_dataset = Dataset3D()

trainloader = data.DataLoader(
    training_dataset,
    batch_size=1,
    shuffle=True)

net = Net().cuda(0)
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

num_epochs = 10
for i in range(num_epochs):
    net.train()
    for j, inputs in enumerate(trainloader):

        inputs = inputs.to('cuda:0')

        optimizer.zero_grad()

        preds = net(inputs)

        print("Epoch {}\n".format(i + 1))

If I set dilation > 1 and torch.backends.cudnn.deterministic = True, it crashes.
On my Tesla V100-SXM2-32GB, it shows the above error. On my RTX 2080 Ti it gives the OOM error:

RuntimeError: CUDA out of memory. Tried to allocate 9.61 GiB (GPU 0; 10.73 GiB total capacity; 1.42 GiB already allocated; 8.50 GiB free; 837.00 KiB cached)

But if I set deterministic flag to be False, OR dilation = 1, it works on both machines.

I only take the deterministic flag literally and don’t know what’s going on in CUDA algorithms.
Is this a bug or I misused the flag?

Thanks for the code snippet! I’ll try to reproduce it.
It seems to be a bug and you didn’t misuse the deterministic flag.