Does nn.Linear performs Convolution operation?

I’m writing a softmax layer at the end of the model for classification but I’m not sure whether the nn.Linear layer does convolution operation and if it doesn’t then the convolution operation needs to be done.

So suppose I have 5 classes to be classified at the end and the output features of the 2nd last layer is 64*8, so should the last layer be something like this:

                return nn.Sequential(
                    nn.Conv2d(64*8, output_channels, kernel_size, stride, padding, bias=False),

or it should be like this:

                return nn.Sequential(

Hi Prithviraj!

The short answer is that a Linear can perform a Conv2d as a special
case (assuming appropriate reshaping of the various tensors).

First make sure you understand how Conv2d and Linear treat the
dimensions of the tensors passed into them – which dimensions are
“channels,” which are 2d dimensions over which to convolve, and
so on.

Assume that we fully flatten (except for the batch dimension)
the tensor passed into Linear so that it has shape
[nBatch, nChannels * height * width] (and Linear is
Linear (nChannels * height * width, output_channels)).
Then Linear doesn’t perform a convolution, per se, but it is
more general than a convolution in that you could construct a
Linear (that has lots of zeroes for weights) that performs a
convolution (assuming that you appropriately reshape the
output tensor).

Note that your first version does not have a non-linearity in between
the Conv2d and Linear layers. They therefore combine into a single
linear operation on your input tensor. They can therefore be replaced
by a single Linear layer (assuming that you pay proper attention to
the shapes of the tensors and are willing to flatten the input tensor).
To say this a little bit differently, a Linear is the most general,
fully-connected layer, so it can perform any linear operation, including,
as a special case, a convolution, or a convolution followed by a
Linear (assuming no intervening non-linearity).

(As an aside, assuming that you’re performing “global” classification,
that is assigning the whole image to a single class, rather than
per-pixel classification, such as semantic segmentation, by the time
you get to the “last” layer you have probably reduced things down
global features rather than features that still maintain some spatial
location. So in your “last” layer there would no longer be any spatial
dimensions over which to convolve.)


K. Frank

Hello Frank,

Firstly thank you for the detailed explanation

Yes frank I’m performing global classification.

Will there be a need to do that?

my image input shapeat the starting is [batch_size,channels,image_size,image_size]

and by the time it comes to last layer, I believe the shape is [batch_size,64*8,4,4]

and I’m (want) converting this to shape [batch_size,output_channels,1,1] after the convolution (then the linear and softmax stuff can come).

So now you know the shapes. what do u recommend i should do?

Hi Prithviraj!

In that case, you will probably want to use CrossEntropyLoss as
your loss criterion. Therefore you will not want to use an explicit
Softmax as CrossEntropyLoss has, in effect, Softmax built in.

If you don’t want to have a non-linearity (the activation in my example,
below) then the convolution becomes redundant. So you might as well
just have a single Linear layer (no Conv2d) for which you will need
to flatten the (non-batch dimensions of the) tensor you pass into the
Linear. This is illustrated below.

It really depends on what you want to do. What is your proposed
Conv2d supposed to accomplish? What is the final Linear supposed
to accomplish?

In any event, you can shrink your last two dimensions, [4, 4] to
[1, 1] by convolving with a kernel of size 4.

Then to pass this to a subsequent Linear you have to get rid of the
trailing [1, 1] dimensions (for example, by squeeze()ing).

These various points are illustrated by the following script:

import torch
print (torch.__version__)

_ = torch.manual_seed (2021)

n648 = 3   # smaller placeholder for 64*8
batch_size = 2
output_channels = 5
conv = torch.nn.Conv2d (n648, output_channels, 4)
activation = torch.nn.Sigmoid()   # for example
lin = torch.nn.Linear (output_channels, output_channels)

t1 = torch.randn (batch_size, n648, 4, 4)
print ('t1.shape =', t1.shape)
t2 = conv (t1)
print ('t2.shape =', t2.shape)
t3 = activation (t2)   # without non-linearity, conv becomes redundant
print ('t3.shape =', t3.shape)
t4 = t3.squeeze()
print ('t4.shape =', t4.shape)

prediction = lin (t4)
print ('prediction.shape =', prediction.shape)
target = torch.randint (output_channels, (batch_size,))
print ('target.shape =', target.shape)

# no softmax between final Linear layer and CrossEntropyLoss

loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn (prediction, target)
print ('loss =', loss)

# if you want just a single Linear layer, you can flatten t1

t1_flat = torch.flatten (t1, start_dim = 1)
print ('t1_flat.shape =', t1_flat.shape)
lin2 = torch.nn.Linear (n648 * 4 * 4, output_channels)
prediction2 = lin2 (t1_flat)
print ('prediction2.shape =', prediction2.shape)
loss2 = loss_fn (prediction2, target)
print ('loss2 =', loss2)

And here is its output:

t1.shape = torch.Size([2, 3, 4, 4])
t2.shape = torch.Size([2, 5, 1, 1])
t3.shape = torch.Size([2, 5, 1, 1])
t4.shape = torch.Size([2, 5])
prediction.shape = torch.Size([2, 5])
target.shape = torch.Size([2])
loss = tensor(1.6289, grad_fn=<NllLossBackward>)
t1_flat.shape = torch.Size([2, 48])
prediction2.shape = torch.Size([2, 5])
loss2 = tensor(2.1311, grad_fn=<NllLossBackward>)


K. Frank

Hello Frank,

Firstly sorry for the late reply. I have been pretty caught up lately.

Yes, my aim was to use softmax at the last layer. So, that’s why i was thinking to use Linear as my last layer of the network (so that I can then send use the softmax) then the question arrived that whether it performs convoution or not.

So as you said Cross entropy loss has a built in softmax, so i think I can end my model at the last as just one convolution layer with no activation and will straight away send that output to Cross Entropy loss. (Also give your thoughts)

I don’t think if there a need to flatten out and send it to linear layer based on what I said above.

And thanks for all the extra effort you put up.