Bug of Conv3d on P100 GPU

Hi,

When I feed an input with batch size N > 65535 into Conv3d layers, the output after index 65535 is obviously incorrect. This happens only on P100 GPU, not on CPU or other GPU. Possibly it also happens on P4 GPU according to my previous tests, but now I don’t have P4 GPU in hand to test again. It happens only in Conv3d layer, not in Conv1d or Conv2d.

code, run in colab

import torch
from torch import nn

net = nn.Conv3d(1, 1, 1, bias=False)
input = torch.rand(70000, 1, 2, 2, 2)
out_cpu = net(input).cuda()

net.cuda()
out_gpu = net(input.cuda())

error = torch.sum((out_cpu - out_gpu).detach()**2, dim=(1,2,3,4))
print(error[65500:65600])

The output on P100 is

tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.2917, 0.4172, 0.7399, 0.4553, 0.3345, 0.8831, 0.3979, 0.5914, 0.5503,
        0.5707, 0.5280, 0.7113, 0.5516, 0.8872, 0.6879, 0.4335, 0.7914, 0.4365,
        0.2578, 0.2922, 0.1646, 0.7618, 0.5094, 0.3610, 0.6823, 0.8531, 0.6192,
        0.3508, 0.2554, 0.9788, 0.3178, 0.6107, 0.2074, 1.0488, 0.2410, 0.1997,
        0.8527, 0.5012, 0.3539, 0.6145, 0.4775, 0.5919, 0.7322, 0.6376, 0.5392,
        0.6394, 0.5922, 0.6976, 0.4430, 0.4933, 0.5123, 0.3211, 0.2196, 0.6387,
        0.2673, 0.1693, 0.2910, 0.4832, 0.3100, 0.4031, 0.3633, 0.5821, 0.4544,
        0.2899], device='cuda:0')

I cannot reproduce it with a current master build on a P100. Which PyTorch version are you using?

Thanks for your reply.
The pytorch version is 1.8.1+cu101

Thanks for the information! How did you install the CUDA10.1 version, as the install instructions mention CUDA10.2 or CUDA11.1?

I didn’t install it manually. It’s prepared on colab.

By the way, the output is correct when group >= 2 in Conv3d layer

Thanks for your previous reply. It inspired me to check the cuda version in colab. The cuda version I get in colab is 11.2. The problem occurs when I use PyTorch version 1.8.1+cu101 or 1.8.1+cu102, while it disappears in 1.8.1+cu111.

Yeah, that was what I wanted to suggest next and I also verified that it’s not showing any errors in 11.1.