Bug of Conv3d on P100 GPU

AoChen · April 30, 2021, 8:22am

Hi,

When I feed an input with batch size N > 65535 into Conv3d layers, the output after index 65535 is obviously incorrect. This happens only on P100 GPU, not on CPU or other GPU. Possibly it also happens on P4 GPU according to my previous tests, but now I don’t have P4 GPU in hand to test again. It happens only in Conv3d layer, not in Conv1d or Conv2d.

code, run in colab

import torch
from torch import nn

net = nn.Conv3d(1, 1, 1, bias=False)
input = torch.rand(70000, 1, 2, 2, 2)
out_cpu = net(input).cuda()

net.cuda()
out_gpu = net(input.cuda())

error = torch.sum((out_cpu - out_gpu).detach()**2, dim=(1,2,3,4))
print(error[65500:65600])

The output on P100 is

tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.2917, 0.4172, 0.7399, 0.4553, 0.3345, 0.8831, 0.3979, 0.5914, 0.5503,
        0.5707, 0.5280, 0.7113, 0.5516, 0.8872, 0.6879, 0.4335, 0.7914, 0.4365,
        0.2578, 0.2922, 0.1646, 0.7618, 0.5094, 0.3610, 0.6823, 0.8531, 0.6192,
        0.3508, 0.2554, 0.9788, 0.3178, 0.6107, 0.2074, 1.0488, 0.2410, 0.1997,
        0.8527, 0.5012, 0.3539, 0.6145, 0.4775, 0.5919, 0.7322, 0.6376, 0.5392,
        0.6394, 0.5922, 0.6976, 0.4430, 0.4933, 0.5123, 0.3211, 0.2196, 0.6387,
        0.2673, 0.1693, 0.2910, 0.4832, 0.3100, 0.4031, 0.3633, 0.5821, 0.4544,
        0.2899], device='cuda:0')

ptrblck · April 30, 2021, 8:27am

I cannot reproduce it with a current master build on a P100. Which PyTorch version are you using?

AoChen · April 30, 2021, 8:37am

Thanks for your reply.
The pytorch version is 1.8.1+cu101

ptrblck · April 30, 2021, 8:39am

Thanks for the information! How did you install the CUDA10.1 version, as the install instructions mention CUDA10.2 or CUDA11.1?

AoChen · April 30, 2021, 8:41am

I didn’t install it manually. It’s prepared on colab.

AoChen · April 30, 2021, 8:54am

By the way, the output is correct when group >= 2 in Conv3d layer

AoChen · April 30, 2021, 9:37am

Thanks for your previous reply. It inspired me to check the cuda version in colab. The cuda version I get in colab is 11.2. The problem occurs when I use PyTorch version 1.8.1+cu101 or 1.8.1+cu102, while it disappears in 1.8.1+cu111.

ptrblck · April 30, 2021, 9:47am

Yeah, that was what I wanted to suggest next and I also verified that it’s not showing any errors in 11.1.