I have a simple implementation of resnet, composed of {BatchNorm2d, Conv2d, relu, and Linear}. I trained it with some data, saved it to disk via jit.trace()
, reloaded it into memory via jit.load()
, put it in eval-mode, and passed an input tensor whose first dimension is batch_size
, consisting of all zeros. My expectation is that the first row of the output should not vary significantly as I vary batch_size
. However, I instead see very large differences (>1e-3) based on batch_size
.
To reproduce:
git clone https://github.com/shindavid/pytorch-issue.git
cd pytorch-issue
python demo.py model.pt
Here are the contents of the demo.py
script:
"""
python demo.py model.pt
"""
import random
import sys
import numpy as np
import torch
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.set_printoptions(linewidth=200)
torch.use_deterministic_algorithms(True)
filename = sys.argv[1]
print('Testing: ' + filename)
net = torch.jit.load(filename)
net.to('cuda')
net.eval()
torch.set_grad_enabled(False)
def get_output(batch_size):
input_tensor = torch.zeros((batch_size, 2, 7, 6)).to('cuda', non_blocking=True)
output_tuple = net(input_tensor)
output_tensor = output_tuple[0]
return output_tensor[:1].to('cpu')
out1 = get_output(1)
failed = False
for b in range(2, 64):
out = get_output(b)
if torch.all(out == out1):
pass
else:
failed = True
print('Batch size {} is NOT OK. Diffs: {}'.format(b, out - out1))
if not failed:
print('All ok!')
The output includes lines like this, demonstrating differences that exceed 1e-3:
Batch size 63 is NOT OK. Diffs: tensor([[ 0.0014, -0.0002, 0.0013, -0.0009, 0.0013, 0.0002, 0.0004]])
I can provide more description of the architecture of the model if needed, although anyone can clone the above repo and inspect the model directly. One observation I made is that if I train an alternative model that has all nn.BatchNorm2d
layers removed, then I do not observe this batch-size-dependent output behavior, hence the title of this post.
Other observations:
- If I keep all values on CPU, the differences become smaller (from 1e-3 to ~1e-7).
- Loading the same model in an equivalent c++ program leads to the same output values.
- The value of
net.parameters()
appears to never change as a result of anyget_output()
call. Specifically, the signature ofnet
never changes, where the signature is defined by:
def signature(net):
return tuple(tuple(map(float, p.flatten())) for p in net.parameters())
I am using pytorch 1.12.1 and CUDA 11.6.
I found some seemingly related posts ([1], [2], [3]), but none seem to explain what I am observing.
This issue is causing some major issues in my research project, so any solutions would be much appreciated!