Why add an extra dimension to convolution layer weights?

Wesley_Neill · June 25, 2020, 9:07pm

In the following sample class from Udacity’s PyTorch class, an additional dimension must be added to the incoming kernel weights, and there is no explanation as to why in the course. I’ve highlighted this fact by the multi-line comment in __init__:


class Net(nn.Module):
    """
    Network containing a 4 filter convolutional layer and 2x2 maxpool layer. 
    """
    def __init__(self, weights):
        """
        weights: the kernel values as a tensor (n_kernels, 1, k_height, k_width).
        """
        super(Net, self).__init__()
        
        # Get height and width of kernel
        k_height, k_width = weight.shape[2:]
        
        # define a 4 feature convolutional layer
        self.conv = nn.Conv2d(1, 4, kernel_size=(k_height, k_width), bias=False)
        self.conv.weight = torch.nn.Parameter(weight)
        
        # define a (2x2) pooling layer
        self.pool = nn.MaxPool2d(2,2)
        
    def forward(x):
        conv_x = self.conv(x)
        relu_x = F.relu(conv_x)
        pool_x = self.pool(relu_x)
        return conv_x, relu_x, pool_x

This is also illustrated in the class notebook with the following code:

filter_vals = np.array([[-1, -1 , 1, 1]]*4)
filter_1 = filter_vals
filter_2 = -filter_1
filter_3 = filter_1.T
filter_4 = -filter_3

filters = np.array([filter_1, filter_2, filter_3, filter_4])
weights = torch.from_numpy(filters).unsqueeze(1).type(torch.FloatTensor)

I’m just wondering why the class wasn’t simply designed to take a kernel tensor that has shape (4,4,4). Why did they change it to (4,1,4,4)?

Nikronic · June 25, 2020, 10:04pm

Hi,

Conv2d needs 2D kernels with 1 channel (grayscale mode, 3 in RGB). For having outputs with more than one, you need to run conv2d out_channel times using [1, k, k] size kernels so the result will be like [out_channel, h, w] because all the respones to out_channel different [1, k, k] kernels have been concatenated.

For instance, assume a case that your input image has 10 dimensions [batch_size, 10, h, w] and you want to have 3 as output channel, [batch_size, 3, h, w]. In this case, we need 3 different filters that each has size of [10, k, k]. Each one will create a output with size of [batch_size, 1, h, w] and finally all will be concatenated to have a output [batch, 3, h, w]. So, the kernel size in this case would be [3, 10, k, k].

This thread may also help:

Bests

Wesley_Neill · June 25, 2020, 11:25pm

Thank you again for a wonderful, easy to understand explanation!