Implement Fully Connected using 1x1 Conv

albert_ariya · March 12, 2021, 10:51pm

Hi,

In theory, fully connected layers can be implemented using 1x1 convolution layers. Following are identical networks with identical weights. One implemented using fully connected layers and the other implemented the fully connected network using 1x1 convolutions.

However, the results are different. I am not able to explain the difference in the results. What have I done wrong in the following code?

import torch
import torch.nn
import torch.nn.functional as F

class Model(torch.nn.Module):
    def __init__(self, input_dim: int):

        super(Model, self).__init__()

        self._backbone = torch.nn.Sequential(
            torch.nn.Linear(input_dim, 64, bias=True),
            torch.nn.ELU(),
            torch.nn.Linear(64, 128, bias=True),
            torch.nn.ELU(),
            torch.nn.Linear(128, 256, bias=True),
            torch.nn.ELU(),
        )
        self._logits = torch.nn.Linear(256, 4, bias=True)

    def forward(self, x: torch.Tensor):

        feats = self._backbone(x)
        logits = self._logits(feats)

        return logits, feats


class ModelConv(torch.nn.Module):
    def __init__(self, input_dim: int):

        super(ModelConv, self).__init__()

        self._backbone = torch.nn.Sequential(
            torch.nn.Conv2d(input_dim, 64, 1, bias=True),
            torch.nn.ELU(),
            torch.nn.Conv2d(64, 128, 1, bias=True),
            torch.nn.ELU(),
            torch.nn.Conv2d(128, 256, 1, bias=True),
            torch.nn.ELU(),
        )
        self._logits = torch.nn.Conv2d(256, 4, 1, bias=True)

    def forward(self, x: torch.Tensor):

        x = x.unsqueeze(3).permute(0, 2, 1, 3)

        feats = self._backbone(x)
        logits = self._logits(feats)

        feats = feats.squeeze(3).permute(0, 2, 1)
        logits = logits.squeeze(3).permute(0, 2, 1)

        return logits, feats


if __name__ == '__main__':
    input_dim = 256
    num_classes = 4
    samples = 5
    batch_size = 16
    torch.manual_seed(2010)
    x = torch.randn(batch_size, samples, input_dim)
    layers_width = [50, 100, 150]

    def init(m):
        if hasattr(m, 'weight'):
            torch.nn.init.constant_(m.weight, 1)
        if hasattr(m, 'bias'):
            torch.nn.init.constant_(m.bias, 1)

    device = torch.device('cuda:0')
    x = x.to(device)

    model = Model(input_dim)
    model.apply(init)
    model.cuda(device)
    for p in model.parameters():
        assert p.max() == p.min()
    logits, feats = model.forward(x)


    model1 = ModelConv(input_dim)
    model1.apply(init)
    model1.cuda(device)
    for p in model1.parameters():
        assert p.max() == p.min()

    logits1, feats1 = model1.forward(x)

    print(torch.mean(logits1-logits))
    print((feats1-feats).sum())

patrickwilliams3 · March 12, 2021, 11:40pm

Are the height and width in the input dimension greater than 1? if so then a 1x1 convolutional layer would not act as a fully connected layer, rather a a fully connected layer for each individual node.

If the height and width are greater than 1, than you can make a convolutional layer with a kernel the same height and width as the input image to act as a flatten + fully connected layer.

albert_ariya · March 13, 2021, 2:04am

Thank you @patrickwilliams3

A convolution layer computes the inner product along the 1 dimension. A fully connected layer can be implemented using 1x1 convolution. Take a segmentation network as an example. The last layer in a segmentation network is usually implemented using 1x1 convolution.

What it does in fact is computing the inner product of weights in the last layer with the embeddings of each pixel to obtain logits (class predictions) of each pixel.

There is no question that we can implemented a fully connected layer using 1x1 convolutions. What I don’t understand is why the results in the above codes are different.

This might be a dump hypothesis: Perhaps the order of operations in convolution implementation is different? Perhaps something is happening during permutation? More likely, there is a bug in the above code and I just don’t see it. Even more likely, I am entirely

P.S: You can compare the results of 1x1 convolution with tensordot operation as well.

patrickwilliams3 · March 13, 2021, 3:25am

I looked through your code a bit and it seems that you are making a tensor of shape (batch_size, channels, samples, 1) for the 2D convolution.

Since your sample size is greater than one, the convolution differs from a fully connected layer because at each input channel the kernel weight is the same for all five samples. This is a constraint that a fully connected layer would not have allowing the fully connected layer to learn more complex functions. So here the full size of your first convolutional kernel would be (input_channels_output channels, 1, 1) while the full size of the weight on your fully connected layer would be (input_channels,output_channels, samples, 1).

This sort of explains why a (samplesx1) convolutional layer would be equivalent to the fully connected layer here as the (samplesx1) kernel size would be (input channels, output_channels, samples, 1) like the fully connected layer.

I am also fairly new to PyTorch so do not take what I say as gospel, but there is a fundamental difference between the convolutional layer and fully connected layer here.

albert_ariya · March 13, 2021, 6:55am

There are few things missing in your explanation. The fully connected layer here mean a tensordot operation on the last dimension of the input.

Just set samples to 1 and run the code again. it won’t give you identical results.

patrickwilliams3 · March 13, 2021, 7:16pm

I changed the sample size to one and they only differ very slightly when logits or feat is a very large number. Probably some different rounding behaviors.

albert_ariya · March 14, 2021, 5:11am

I actually figured what is happening.
I set the weights to 1 and pass layers through ELU. This generates very high values for embeddings resulting in very high values for logits.

The different between Linear and Conv implementations become negligible if you consider the magnitude of the logits.

Simply change ELU to Sigmod (a squashing activation) and the intial weight to 0.001. The difference between Linear and Conv implementations is going to be small.