Output of single convolutional layer differs for same input processed individually vs in a batch

I am working on a project where I am using a VGG to process images. I am passing two input at once in a batch to the VGG, and I am observing that the output for each input in the batch is different from the corresponding output when the input is processed individually.

After checking the output of each layer, I found the output differs from a conv2d layer, features[7].

I have checked the implementation of the convolutional layer, and I am confident that the forward pass is being performed correctly.

I suspect that there may be some other issue that I am overlooking, and I would appreciate any insights or suggestions on how to resolve this issue.

Here is my test code:

import torch
import torchvision

model = torchvision.models.vgg16(weights=torchvision.models.VGG16_Weights.DEFAULT).eval().cuda()
x = torch.rand(2, 3, 224, 224).cuda()
x1 = x[None, 1]
diffs = [(model.features[:i + 1](x)[1] - model.features[:i + 1](x1)).abs().sum().item() for i in range(30)]
print(diffs)

It shows:

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0730438232421875, 1.6718090772628784, 0.8785703182220459, 41.91109085083008, 16.41341781616211, 152.1153106689453, 42.8715705871582, 290.87322998046875, 30.523223876953125, 15.055337905883789, 134.59072875976562, 29.589153289794922, 140.07200622558594, 28.335479736328125, 140.5168914794922, 8.338151931762695, 3.884805917739868, 20.979028701782227, 4.820128440856934, 20.844335556030273, 3.523458480834961, 17.630252838134766, 0.9822019934654236]

I found it only happens in cuda. If layer is in cpu, there is no problem. Why?

I cannot reproduce such large numerical mismatches and see:

[0.0, 0.0, 0.0020265579223632812, 0.0020227432250976562, 0.0020227432250976562, 0.005452632904052734, 0.0043070316314697266, 0.011111259460449219, 0.011111259460449219, 0.011111259460449219, 0.01072549819946289, 0.01072549819946289, 0.0134124755859375, 0.0134124755859375, 0.01974201202392578, 0.016467034816741943, 0.016467034816741943, 0.01416158676147461, 0.01416158676147461, 0.01039886474609375, 0.01039886474609375, 0.010308146476745605, 0.008580684661865234, 0.008580684661865234, 0.0046024322509765625, 0.0046024322509765625, 0.0031092166900634766, 0.0031092166900634766, 0.0030242204666137695, 0.0026617050170898438]

on a 3090 using torch==2.0.0+cu118.
Disabling TF32 for cuDNN operations lowers the mismatches:

torch.backends.cudnn.allow_tf32 = False
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.814697265625e-05, 2.09808349609375e-05, 4.76837158203125e-05, 2.288818359375e-05, 2.288818359375e-05, 8.58306884765625e-06, 7.62939453125e-06, 1.0013580322265625e-05, 6.198883056640625e-06, 6.67572021484375e-06, 3.1441450119018555e-06]

and using float64 lowers it even more.

I don’t know which setup you are using as you haven’t shared any details, but would recommend updating to the latest stable or nightly release.