Precision difference between GPU and CPU

There is a precision difference between the convolutions executed by CPU and GPU, using Conv2d().

In the worst case, results of forward pass on GPU and CPU are identical only up to 3 digits. If output channel is defined as 1, it requires summing over different channels which ends up in even lower precision; when input channels is greater than zero and output channels and groups are one, the results of forward pass on GPU and CPU are identical only up to 1 digit.

What is the reason for that?

1 Like

That sounds like too big a difference. Do you check that on a single conv layer?
Could you send a small code sample to reproduce this please?

Yes sure. I just check on a single conv layer (e.g., Conv2d) with random hyperparameters.

num_iter = 6000
torch.set_printoptions(precision=6)
for i in range(num_iter):

    padVal      = round(random.uniform(0, 10), 6)
    padAmount   = randint(1, 5) # starts from 1
    weights     = round(random.uniform(0, 10), 6)
    dilated     = randint(1, 4)
    size_input  = randint(15, 25)
    size_kernel = randint(1, 5)
    channel          = randint(2, 5)
    input_channel    = 1
    output_channel   = 1
    group_num        = 1
    stride_num       = randint(1, 5)
   
    print ('Pad value: %6f Pad amount: %s Weights: %6f Size input: %s Size kernel: %s Dilated: %s' 
           %(padVal, padAmount, weights, size_input, size_kernel, dilated))
    
    # random inputs
    non_padded_input = torch.randn(1, input_channel, size_input, size_input)
    padder           = nn.ConstantPad2d(padAmount, padVal)
    padded_input     = padder(non_padded_input)
    
    # now declare nets 
    net_gpu = nn.Conv2d(input_channel, output_channel, size_kernel, 
                               padding = 0, stride = stride_num, dilation = dilated, 
                               groups = group_num, bias = 0).cuda()   
    
    net_cpu = nn.Conv2d(input_channel, output_channel, size_kernel, 
                               padding = 0, stride = stride_num, dilation = dilated, 
                               groups = group_num, bias = 0) # can be outside
    
    # initialize weights with  same random value
    net_gpu.weight.data.fill_(weights).cuda(); 
    net_cpu.weight.data.fill_(weights);
    
    # forward pass
    output_gpu = net_gpu(padded_input.cuda()).cuda()
    output_cpu = net_cpu(padded_input)
    # compare with threshold of second argument in inner function
    if torch.all(torch.lt(torch.abs(torch.add(output_gpu, -output_cpu.cuda())), 1e-3)):
        pass
    else:
        print('!!!!!!!!!!!!!!!!! BUG IN CODE !!!!!!!!!!!!!!!!!!!!')        
        #print ('Output gpu:'); print (output_gpu); print ('Output cpu:');print (output_cpu); 
        assert(False)   
    
print ('Test is done, good to go!')

Hi,

I ran your code sample 10 times and it always returned no issue.
Do you have a special setting where it fails for you? I seems to work fine on my install.