Different conv2d results in pytorch and tensorflow

Hi,

I am trying to train CNN model using TensorFlow and PyTorch but I am getting different results. I have initialized same weights to both frameworks. Attached is the result from both platforms:

PyTorch Model

class FemnistNet(nn.Module):
    def __init__(self):
        super(FemnistNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2)
        self.pool1 = nn.MaxPool2d(2, stride=2, )
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2)
        self.pool2 = nn.MaxPool2d(2, stride=2)
        self.fc1 = nn.Linear(3136, 2048)
        self.fc2 = nn.Linear(2048 ,62)
        
    def forward(self, x):
        x = x.view(-1, 1, 28, 28)
        x = self.conv1(x)
        x = th.nn.functional.relu(x)

        x = self.pool1(x)

        x=self.conv2(x)
        x = th.nn.functional.relu(x)
        
        x = self.pool2(x)
        
        x = x.flatten(start_dim=1)
        
        x = self.fc1(x)
        l1_activations = th.nn.functional.relu(x)
        
        x = self.fc2(l1_activations)

        x = x.softmax()

        return x, l1_activations

Tensorflow

features = tf.placeholder(
                tf.float32, shape=[None, IMAGE_SIZE * IMAGE_SIZE], name='features')
            labels = tf.placeholder(tf.int64, shape=[None], name='labels')
            input_layer = tf.reshape(features, [-1, IMAGE_SIZE, IMAGE_SIZE, 1])
            conv1 = tf.layers.conv2d(
                inputs=input_layer,
                filters=32,
                kernel_size=[5, 5],
                padding="same",
                activation=tf.nn.relu,
                name = "conv1")
            pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
            conv2 = tf.layers.conv2d(
                inputs=pool1,
                filters=64,
                kernel_size=[5, 5],
                padding="same",
                activation=tf.nn.relu,
                name = "conv_last")
            pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)
            pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
            # dense = tf.layers.dense(inputs=pool2_flat, units=2048, name='dense1', kernel_regularizer=tf.contrib.layers.l2_regularizer(0.001), bias_regularizer=tf.contrib.layers.l2_regularizer(0.001))
            dense = tf.layers.dense(inputs=pool2_flat, units=2048, name='dense1')
            act_1 = tf.nn.relu(dense)
            logits = tf.layers.dense(inputs=act_1, units=self.num_classes)
            predictions = {
              "classes": tf.argmax(input=logits, axis=1),
              "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
            }
            # loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

            # cross_entropy = -tf.one_hot(labels,62) * tf.log(predictions["probabilities"] + 1e-7)
            values = tf.one_hot(labels, 62) * tf.log(predictions["probabilities"] + 1e-7)
            reduced_values = tf.reduce_sum(values)
            batch_size = tf.shape(labels)[0]

            loss = - reduced_values / tf.cast(batch_size, tf.float32)#tf.reduce_mean(cross_entropy, axis=-1)

            # TODO: Confirm that opt initialized once is ok?
            train_op = self.optimizer.minimize(
                loss=loss,
                global_step=tf.train.get_global_step())

            eval_metric_ops = tf.count_nonzero(tf.equal(labels, tf.argmax(predictions["probabilities"], axis=1)))

Can anyone please explain the difference? Is there any padding implementation difference in both platforms? Can the padding create any difference? Thanks.

I would recommend to create a single conv layer (or any other layer with parameters) in both frameworks, load the weights from TF to PyTorch, and verify that the results are equal for the same input. Once this works, you could then test blocks until you narrow down where the difference in results is caused.

Thanks @ptrblck for your thoughts. I am using the same weights i.e., loaded weights of TensorFlow in PyTorch. I have set everything else as same but the last thing I was thinking of towards padding. What I have read about padding is that TensorFlow uses asymmetric whereas PyTorch uses symmetric. PyTorch has provided same padding in version 1.9.0 but I am using 1.4.0. Can this be an issue?

I’m not familiar with TF’s implementation of the padding operation, but based on your description it certainly could be causing different results. Were you able to check a single conv to narrow it down?

Thanks @ptrblck . Please see the attached loss for the above model.
combined_pytorch_leaf_testbed_loss

I am also experimenting with one Conv2D layer. For one Conv2D, should I use the following model or what do you suggest?

class FemnistNet(nn.Module):
    def __init__(self):
        super(FemnistNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=0)
        self.pool1 = nn.MaxPool2d(2, stride=2, )
        self.fc1 = nn.Linear(1024, 512)

    def forward(self, x):
        x = x.view(-1, 1, 28, 28)
        x = self.conv1(x)
        x = th.nn.functional.relu(x)
        x = self.pool1(x)

        x = x.flatten(start_dim=1)
        x = self.fc1(x)
        x = x.softmax()

        return x

Is the above model correct for one Conv2D layer? Moreover, any thoughts on the loss? Gradients/Loss or any other component that needs debugging?

Thanks!

I would start with a single nn.Conv2d, use your workflow to initialize the PyTorch module with the TF parameters, and make sure that the outputs as well as the gradients are identical (up to floating point precision) before checking larger blocks or the entire model, as it would be easier.

Thanks @ptrblck for your guidance. Please find below the single layer model of both TF and PyTorch along with their gradients and logits.

TensorFlow:

Model

features = tf.placeholder(
                tf.float32, shape=[None, IMAGE_SIZE * IMAGE_SIZE], name='features')
            labels = tf.placeholder(tf.int64, shape=[None], name='labels')
            input_layer = tf.reshape(features, [-1, IMAGE_SIZE, IMAGE_SIZE, 1])
            conv1 = tf.layers.conv2d(
                inputs=input_layer,
                filters=1,
                kernel_size=[1, 1],
                padding="valid",
                activation=tf.nn.relu,
                name = "conv1")
            pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)
            pool1_flat = tf.reshape(pool1, [-1, 1 * 1 * 1])
            dense = tf.layers.dense(inputs=pool1_flat, units=1, name='dense1')
            act_1 = tf.nn.relu(dense)
            logits = tf.layers.dense(inputs=act_1, units=self.num_classes)

Logits

  [[-0.07566161  0.03802636 -0.01639884 -0.1384197   0.00423112 -0.12213263
  -0.08598097 -0.13527007 -0.04319123  0.04253761 -0.00081721  0.01138
   0.12924342  0.02073286  0.1393949  -0.11032829 -0.06213853  0.04869567
   0.09896822 -0.07929925 -0.09605148  0.08670155 -0.1287576   0.00932503
   0.13710512 -0.04368142  0.0532586   0.12914579 -0.03923068 -0.07455046
  -0.05908488  0.11273665  0.01347538  0.08849659  0.03669336  0.04570408
  -0.03158653  0.02188825  0.02023709  0.10528571  0.1394072  -0.03438641
   0.1172283  -0.10719128 -0.03899509 -0.12875672  0.1061636   0.07973649
   0.06920333  0.13832925 -0.1331273  -0.13372292 -0.07513674  0.05274088
   0.01308073  0.09093275 -0.03216879  0.12579407  0.05085118 -0.06270606
   0.13129717  0.00239096]]

Gradients

grade shape (1, 1, 1, 1) [[[[-0.04313534]]]]
grade shape (1,) [-0.04313534]
grade shape (1, 1) [[-0.0778608]]
grade shape (1,) [-0.08592657]
grade shape (1, 62) [[ 0.00673114  0.00754159  0.00714211  0.00632169  0.00729098  0.0064255
   0.00666204  0.00634164  0.00695329  0.00757569  0.00725427  0.00734329
   0.00826186  0.00741229  0.00834616  0.0065018   0.00682279  0.00762249
   0.00801548  0.0067067   0.00659529  0.00791776  0.00638307  0.00732821
   0.00832707  0.00694989 -0.44722298  0.00826106  0.00698089  0.00673863
   0.00684366  0.00812661  0.00735869  0.00793199  0.00753155  0.00759972
   0.00703446  0.00742086  0.00740862  0.00806628  0.00834626  0.00701479
   0.00816319  0.00652222  0.00698253  0.00638308  0.00807337  0.0078628
   0.00778042  0.00833727  0.00635524  0.00635146  0.00673468  0.00765338
   0.00735579  0.00795133  0.00703036  0.00823341  0.00763893  0.00681892
   0.00827885  0.00727758]]
grade shape (62,) [ 0.01479762  0.01657929  0.01570107  0.01389749  0.01602835  0.01412569
  0.0146457   0.01394133  0.01528599  0.01665425  0.01594763  0.01614334
  0.01816272  0.01629504  0.01834804  0.01429342  0.01499909  0.01675712
  0.01762108  0.01474389  0.01449895  0.01740625  0.01403242  0.0161102
  0.01830607  0.0152785  -0.9831662   0.01816095  0.01534665  0.01481407
  0.01504496  0.01786537  0.0161772   0.01743752  0.0165572   0.01670707
  0.01546441  0.01631388  0.01628696  0.01773275  0.01834826  0.01542117
  0.0179458   0.01433833  0.01535026  0.01403243  0.01774833  0.01728543
  0.01710432  0.0183285   0.01397123  0.01396291  0.01480538  0.01682505
  0.01617082  0.01748005  0.01545541  0.01810018  0.01679328  0.01499057
  0.01820006  0.01599888]

PyTorch

Model

class FemnistNet(nn.Module):
    def __init__(self):
        super(FemnistNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 1, kernel_size=1, stride=1, padding=0)
        self.pool1 = nn.MaxPool2d(2, stride=2, )
        
        self.fc1 = nn.Linear(1, 1)      
        self.fc2 = nn.Linear(1 ,62)

        self.conv1.weight.data.copy_(th.from_numpy(np.transpose(c[0])))
        self.conv1.bias.data.copy_(th.from_numpy(np.transpose(c[1])))
        self.fc1.weight.data.copy_(th.from_numpy(np.transpose(c[2])))
        self.fc1.bias.data.copy_(th.from_numpy(np.transpose(c[3])))
        self.fc2.weight.data.copy_(th.from_numpy(np.transpose(c[4])))
        self.fc2.bias.data.copy_(th.from_numpy(np.transpose(c[5])))
        
    def forward(self, x):
        x = x.view(-1, 1, 2, 2)
        x = self.conv1(x)
        x = th.nn.functional.relu(x)
        x = self.pool1(x)
        x = x.flatten(start_dim=1)
        
        x = self.fc1(x)
        l1_activations = th.nn.functional.relu(x)
        
        x = self.fc2(l1_activations)

        return x, l1_activations

Logits:

tensor([[-0.07566161,  0.03802636, -0.01639884, -0.13841970,  0.00423112,
         -0.12213263, -0.08598097, -0.13527007, -0.04319123,  0.04253761,
         -0.00081721,  0.01138000,  0.12924342,  0.02073286,  0.13939489,
         -0.11032829, -0.06213853,  0.04869567,  0.09896822, -0.07929925,
         -0.09605148,  0.08670155, -0.12875760,  0.00932503,  0.13710512,
         -0.04368142,  0.05325860,  0.12914579, -0.03923068, -0.07455046,
         -0.05908488,  0.11273665,  0.01347538,  0.08849659,  0.03669336,
          0.04570408, -0.03158653,  0.02188825,  0.02023709,  0.10528571,
          0.13940720, -0.03438641,  0.11722830, -0.10719128, -0.03899509,
         -0.12875672,  0.10616360,  0.07973649,  0.06920333,  0.13832925,
         -0.13312730, -0.13372292, -0.07513674,  0.05274088,  0.01308073,
          0.09093275, -0.03216879,  0.12579407,  0.05085118, -0.06270606,
          0.13129717,  0.00239096]], grad_fn=<AddmmBackward>)

Gradients

 torch.Size([1, 1, 1, 1]) gradient tensor([[[[-1.10359001]]]])
 torch.Size([1]) gradient tensor([-1.10359001])
torch.Size([1, 1]) gradient tensor([[-1.99201870]])
torch.Size([1]) gradient tensor([-2.19837618])
torch.Size([62, 1]) gradient tensor([[ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [-8.54095840],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000],
        [ 0.00000000]])
torch.Size([62]) gradient tensor([  0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000, -18.77627563,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000,   0.00000000,   0.00000000,   0.00000000,
          0.00000000,   0.00000000])

The above results show the difference between PyTorch and TF. Can you please highlight what should be the next direction to debug? Logits look same but gradients are different.

How did you calculate the gradients, i.e. did you compute a loss or called backward with a predefined initial gradient?

Thanks @ptrblck for the reply. I computed loss first and then called loss.backward as follows:

PyTorch
Loss function:

def cross_entropy_with_logits(logits, targets, batch_size):
    values = (targets * th.log(logits))
    reduced_values =  values.sum()
    result = - reduced_values/ batch_size

    return result

Computing loss

logits, activations = model.forward(X)
loss = cross_entropy_with_logits(logits, y, batch_size)

# backprop
loss.backward()

Printing gradients

for param in model.parameters():
     print(param.grad)

Could you use the output directly to test the specific block via output.mean().backward() in both frameworks?
I would still recommend to scale down the issue and start with small unit tests, as you are currently testing multiple modules at once.

Can you please suggest the smallest model which I should be using? Currently, I used one conv2d layer and 2 dense layers.

Based on the code it seems you are currently using 1 conv, 2 linear, 1 pool, 2 relu modules as well as your custom criterion, so I would start with a single conv layer and make sure the outputs as well as the gradients match. Once this is done, add layers one by one and make sure the test is still working until you are able to verify the entire block.