Difference in Pytorch BN and Caffe2 BN

Hello, I’ve converted resnet from torchvision for use in caffe2, but I get different answers in the latter. It seems conv layers give the same results but batch norms do not.

It’s been hard to try and trace the differences, so I was hoping to get some help. I realize the frameworks may do floating-point math in different order and get errors propagated, etc., but I am hoping for a simple fix.

One thing I’ve tried is inverting the ‘running_variance’ from pytorch bn when changing to ‘_riv’ and copying to ‘_siv’ variables in caffe2 bn. It’s unclear if caffe2 stores and processing inverse variance? (The code seems to suggest so, but the docs just talked about “running/saved_var”). Anyway it didn’t work.

ONNX is a no-go since it doesn’t seem to support feature concatentation.