self.conv1 = nn.Conv2d(3, 6, 3)
self.conv2 = nn.Conv2d(6, 16, 3)
self.fc1 = nn.Linear(3*224*224, 25088)
self.fc2 = nn.Linear(25088, 12008)
self.fc3 = nn.Linear(12008, 1734)
self.fc4 = nn.Linear(1734, 67)
def forward(self, x):
x = F.relu(self.conv2(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.fc4(x)
net = Net()
This works. but I do not understand whats happenning in: self.fc1 = nn.Linear(3*224*224, 25088)
After the convolutions and having different kernels, it seems to me that there should be different number of features. how is the second dimension of the tensor that goes into fc1 has 224x224x3 features?
I only ran this code and didnt try this model with an input image.
The input dimension of your linear layer depends upon the output of your last Convolution layer. Each input after going through convolutions goes through a decrement in spatial size, while number of channels increase. But the end result of a convolution is still a feature map of size (N,C,h’,w’). This layer needs to be flattened out into a linear shape of (N,Cxh’xw’). This is done by the line below:
Once we have the layer flattened out, we pass the flattened layer through linear layers fc1, fc2, and so on.
As per the network definition provided by you, the first convolution layer self.conv1 expects an input having 3 channels, and outputs a 6 channel feature map, as mentioned in the layer definition below:
This way after convolution your number of channels increase. In my answer, C represents the number of channels in the last convolution layer, which happen to be 16 in your case.
I think Iv’e confused you.
If the number of channels in the last convolution layer is 16,
Wouldn’t we need to write this line: fc1 = nn.Linear(16*224*224, 25088)
Instead of: self.fc1 = nn.Linear(3*224*224, 25088)
So, I added a few print statements in the forward call to get the intermediate shape of the outputs, turns out the linear input shape should be totally something else. I used an input of (2,3,224,224) where 2 is the batch size, 3 is the number of channels, and 224 is the height and width of input.
The output shapes we get are:
Shape after first convolution: torch.Size([2, 6, 222, 222])
Shape after second convolution: torch.Size([2, 16, 220, 220])
So the shape for first linear channel becomes:
self.fc1 = nn.Linear(16*220*220, 25088)
The thing to notice is that after every convolution we lose certain pixels in both height and width. So it becomes 220 instead of 224.