Bug of copied convolution module with 1x1 kernel?

I failed to train two models sharing same parameter to produce same result.
It seems that a model including 1x1 kernel convolution module makes bad gradients when it was copied.

OS: ubuntu 16.04
PyTorch version: 0.3.1
How you installed PyTorch (conda, pip, source): conda in docker container(ubuntu 16.04)
Python version: 3.6.1
CUDA/cuDNN version: 9.0
GPU models and configuration: k80 (aws p2 instance)
GCC version (if compiling from source):

import torch
import torch.nn as nn
from torch.autograd import Variable

import numpy as np

class Model(nn.Module):
    def __init__(self, num_classes=1000):
        super(Model, self).__init__()
        self.conv1x1 = nn.Conv2d(3, 64, kernel_size=1, stride=1, padding=0, bias=False)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(64, num_classes)

    def forward(self, x):
        x_c = self.conv1x1(x)
        x_pool = self.pool(x_c).view(-1, 64)
        output = self.fc(x_pool)
        return x_c, x_pool, output


model1 = Model().cuda()
model2 = Model().cuda()

input_np = np.random.random((1, 3, 64, 64))
input = Variable(torch.Tensor(input_np), requires_grad=False).cuda()
criterion1 = nn.CrossEntropyLoss().cuda()
criterion2 = nn.CrossEntropyLoss().cuda()

target = Variable(torch.LongTensor(np.array([0])), requires_grad=False).cuda()

x_c_1, x_p_1, o1 = model1(input)
x_c_2, x_p_2, o2 = model2(input)
loss1 = criterion1(o1, target)
loss2 = criterion2(o2, target)

param1 = list(model1.parameters())
param2 = list(model2.parameters())
for p1, p2 in zip(param1, param2):
    # fail when parameter is in conv layer
    print("params have same gradient?: ", p1.data.numel() == (p1.grad.data == p2.grad.data).nonzero().size(0))

When height and width of input are small (<= 32x32), gradient of the module is same as master model’s gradient.
And 3x3 kernel convolution modules didn’t cause the problem whatever their inputs size were.
Am I missing something? Thanks!

My guess is that under some conditions the GPU makes slight approximations that can differ from one run to another.

Try counting the gradients that differ by more than some small amount.

epsilon = 1e-8
print("params have same gradient?: ", 
    p1.data.numel() == ((p1.grad.data - p2.grad.data).abs() <= epsilon).nonzero().size(0))

Note that strictly speaking the models do not share parameters. They are just identically initialised. When one model is trained, the parameters of the other model will not change.

Maximum element-to-element difference between the grads was around 1e-8, so this seems not to cause a big problem in practice.
Thank you for your reply.