Just starting to use images to learn, I have a question. I have a 80*80 image going into the following architecture:
- Convolution that outputs 16 feature maps using an 8*8 kernel Conv2d(1,16,(8,8))
- Convolution that takes the 16 outputs as input and outputs 32 feature maps using an 4*4 kernel Conv2d(16,32,(4,4))
- I want to take the output of this last layer and stick it into a 256 unit fully connected layer but I run into a dimensional error.
Any tips ?
Thanks a lot !
Do you know what the size of the output of the last layer (the one before the fully connected layer) is?
I don’t. How can I find out ?
You can print it with
output is your tensor right before you feed it to the fc layer).
It says [1,32,70,70]. Do I have to multiply all of that ? Then it will be a 156800*256 matrix ? That sounds like a lot, doesn’t it ?
That sounds right (linear docs here).
If that’s too big you can downsample using a pooling layer. Something like maxpool
But is it actually common to have matrices that big? It sounds impressive.
In the common architectures pooling operations are usually used.
However, such high number are also common to see.
Have a look at the classic VGG16 architecture.
The last pooling layer returns an activation of
[batch, 512, 7, 7] which is fed into a
Linear layer with
4096 units. This connection has
512*7*7*4096 = 100,760,448 parameters (skipping the bias here).