Discrepancy between Conv3D and my custom implementation

Hi all,

I am trying to implement Conv3D with unfold function. However, when I compare the results with the build-in Conv3D, they agree on CPU but NOT on CUDA. Is there a reason behind this?
You can reproduce with the following code:

import torch
import torch.nn as nn
import torch.nn.functional as F

channels = 5
h, w, d = 4, 4, 4

def test(device):
    image = torch.randn(channels, h, w, d).to(device) # input image

    kh, kw, kd = 3, 3, 3 # kernel size
    dh, dw, dd = 1, 1, 1 # stride

    # Create conv
    conv = nn.Conv3d(channels, 10, (kh, kw, kd), padding='same', bias=False).to(device)
    filt = conv.weight

    # Manual approach
    patches = F.pad(image, (1,)*6)
    patches = patches.unfold(1, kh, dh).unfold(2, kw, dw).unfold(3, kd, dd)

    patches = patches.contiguous().view(channels, -1, kh, kw, kd)

    nb_windows = patches.size(1)

    # Now we have to shift the windows into the batch dimension.
    # Maybe there is another way without .permute, but this should work
    patches = patches.permute(1, 0, 2, 3, 4)

    # Calculate the conv operation manually
    res = patches.flatten(1) @ filt.flatten(1).transpose(0, 1)
    res = res.transpose(0, 1) # out_channels, output_pixels

    res = res.unflatten(1, (h, w, d))

    # Module approach
    out = conv(image)
    print('max abs error ', (out - res).abs().max())

print('Test on CPU')
test(torch.device("cpu")) # 4.7684e-07
print('Test on CUDA')
test(torch.device("cuda")) # 0.0005

If you are using an Ampere GPU or newer, disable TF32 and the error should be smaller via torch.backends.cudnn.allow_tf32 = False.

1 Like

Does it mean my implementation will lose accuracy if we have `torch.backends.cudnn.allow_tf32 = True’?

Yes, precision will be lost as only 10 bits of mantissa will be stored for the trade-off of using TensorCores as described here. The link also links to more detailed blog posts about TF32.