SpatialConvolution / Conv2d: different results using Pytorch and torch7 for float tensors

Alex_O · September 28, 2020, 6:12pm

Hi,
I tried to use convolution in Pytorch and torch7 (lua).
Operations with the same tensors (with float type) produce different results.

Python code:

import torch
import torch.nn as nn 

torch.set_default_tensor_type('torch.FloatTensor')

def test_conv():
    layer = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1, bias=True)
    layer.weight.data.fill_(2.2)
    layer.bias.data.fill_(1.2)  

    tensor = torch.zeros((1, 64, 256, 256))
    tensor.fill_(1.3)
    # print(tensor)

    result = layer(tensor)
    print("[test_conv] result: shape={ %s }, type='%s'\n" % (result.shape, result.type()) )

    result_flatten = result.flatten()

    i = 0
    for n in result_flatten:
        if i >= 30: break
        number = n.item()
        print("[%d] %f" % (i+1, number) )
        i += 1

test_conv()

Result:

 [1] 733.357605
 [2] 1099.435791
 [3] 1099.435791
...
 [30] 1099.435791

Torch7 lua code:

require 'nn'

torch.setdefaulttensortype('torch.FloatTensor')

function test_conv()
    local kernel_size = 3
    local stride = 1
    local padding = 1
    local layer = nn.SpatialConvolutionMM(64, 64, kernel_size, kernel_size, stride, stride, padding, padding)
    layer.weight:fill(2.2) -- fill weigths with 2.2
    layer.bias:fill(1.2)   -- fill weigths with 1.2

    local tensor = torch.Tensor(1, 64, 256, 256)
    tensor:fill(1.3)       -- fill tesnor with 1.3
    -- print(tensor)

    local result = layer(tensor)

    print(string.format("result: shape={ %s }, type='%s'\n", result:size(), result:type()) )

    local result_flatten = result:view(result:nElement())
    for i = 1, 30 do
        print(string.format("[%d] %f", i, result_flatten[i]))
    end
end 

test_conv()

Result:

[1] 733.360107	
[2] 1099.439697	
[3] 1099.439697	
....
[30] 1099.437012

Difference between results:

pytorch                      torch7
[1] 733.357605          [1] 733.360107	
[2] 1099.435791        [2] 1099.439697	
[3] 1099.435791        [3] 1099.439697	
....
[30] 1099.435791      [30] 1099.437012

Maybe it was caused by different float-point arithmetic in Pytorch and torch7?
or should use a different convolution operator?

tom · September 28, 2020, 7:14pm

As you suspect, differences like this are within numerical accuracy and thus would be expected between different implementations of the same operation (this is 3e-3 on a number of size 7e3, so ~2e-6ish relative error, which seems not unusual).

Best regards

Thomas

Alex_O · September 28, 2020, 7:35pm

Thanks for your reply. I’m trying to use PyTorch pre-trained model in torch7 (lua) and C (libTNN). Seems these small differences may cause huge errors in the output of the entire network (which contains a lot of convolution layers).

tom · September 28, 2020, 7:45pm

Yeah, unfortunate as it is, something isn’t terribly robust about the model because errors of that size might happen even when switching backends within PyTorch.
You could try to train the PyTorch model for a few steps to give answers that more closely match the torch7 ones.