How does a transposed convolutional layer apply its kernels to the inputs?

Jide_Olabisi · May 9, 2021, 10:38pm

Hi, I know how a convolutional layer apply its kernels to an input but I don’t understand how transposed convolutional layer apply its kernels to an input. My main question is below.

Question:
If a convolutional layer takes in a three channel input and outputs a two channel output and the shape of each kernels are 2x2, I know surely that it will have 6 kernels divided into two groups because it is to give out a 2 channel output so therefore two groups has shown below.

from torch import nn
conv1 = nn.Conv2d(3,2,2,1,0)
conv1.weight.data.numpy()

array([[[[-0.09294177,  0.19253497],
     [-0.27820718, -0.07189114]],

    [[-0.27682984, -0.05606458],           <---- one group
     [ 0.01134909, -0.21749675]],

    [[-0.2351923 ,  0.2374857 ],
     [-0.0346033 , -0.26447   ]]],


   [[[ 0.2467276 ,  0.06628369],
     [ 0.26501465,  0.11644475]],
                                           <----- one group making two 
                                                  groups of three
    [[-0.09835644, -0.06396657],
     [-0.05590855,  0.06890304]],

    [[ 0.22788118,  0.22287966],
     [-0.20899878, -0.03188486]]]], dtype=float32)

You can see that it has six kernels divided into two groups of 3 which makes sense but if a transposed convolutional layer were to take in a three channel input and outputs a two channel output and the shape of each kernels being 2x2, why is it that instead of it having 6 kernels divided into two groups of three, it has 6 kernels divided into 3 groups of two as shown below?

conv2 = nn.ConvTranspose2d(3,2,2,1,0)
conv2.weight.data.numpy()

array([[[[ 0.28577724, -0.29587495],
     [-0.24003945, -0.3524448 ]],
                                    <---- one group
    [[-0.15984103,  0.22188954],
     [-0.10990701, -0.20565327]]],


   [[[ 0.17101079, -0.17623127],
     [-0.12097928, -0.0211492 ]],
                                     <----- one group
    [[-0.21161021, -0.33530322],
     [-0.16497111,  0.19984488]]],


   [[[-0.05084743, -0.2563213 ],
     [-0.28287342, -0.30839682]],
                                     <----- one group making three groups 
                                            two
    [[-0.330719  ,  0.07809895],
     [-0.16823643, -0.34404978]]]], dtype=float32)

You can see above that it has 6 kernels divide into 3 groups of 2. Isn’t it meant to be divided into 2 groups of three because it is to give out two channel thats why it should be divided into two groups like the convolutional layer? Why is the arrangement this way and how does 6 kernels divided into 3 groups of 2 turn a 3 channel input into two?