Supposed I have a CNN with conv layers and fully connected layers. Since image tensors in pyTorch is C * H * W, do we need to permute the dimensions of the feature maps after the final conv layer?
I’m thinking this might not be necessary since we have fully connected layers and the neurons in those layers will learn to pick the correct tensor values from the convolution feature maps during training.
But I could be wrong. Please advice.
What does permute the dimensions means?
In my shallow view, between conv layer and fc layer, there should be a flatten operation in tensorflow, and i think it is similar in pytorch.
Yup, there should be a flatten operation. But do we need to first change the dimensions of (batch,c,h,w) to (batch,h,w,c) before flattening?
of course not,
data_format is default in
NCHW in pytorch.
What if I have something like this:
conv1 -> flatten -> dense -> dense -> reshape -> conv2
Where conv1’s output shape is (batch, c, h, w) and conv2’s input shape is (batch, h, w, c). Do you think I should permute dimension somewhere in between, or do you think the fully connected layers will do the mapping?
The conv2’s input shape is (batch, h, w, c) does not make sense to me.
the dim of input and output of conv should be (batch, c, h, w) and if it is (batch, h, w, c) there will occur an error.
self.conv1 = nn.Conv2d(3, 5, kernel_size=3, padding=1)
self.fc1 = nn.Linear(5, 10)
self.fc2 = nn.Linear(10, 5)
self.conv2 = nn.Conv2d(5, 3, kernel_size=3, padding=1)
def forward(self, input):
output = self.conv1(input)
output = self.fc1(output)
output = self.fc2(output)
output = self.conv2(output)
a = torch.randn(size=(1, 3, 5, 5))
model = model()
output = model(a)
Have a look at the YOLO architecture in the picture above.
Notice that there are two dense layers between the front conv layers (output shape: 7x7x1024) and the final output which is in 7x7x30.
I’m trying to implement it in Pytorch and so the output of the layer before the two dense layers is 1024x7x7 (CxHxW).
Then I first reshape/flatten before passing to the two dense layers.
After the second dense layer, I again reshape it to 7x7x30.
Notice I went from 1024x7x7 (CxHxW) to 7x7x30 (HxWxC) (my labels is build using this format). My question is whether this is acceptable? I believe the dense layers between should handle this mapping from CxHxW to HxWxC. Or should I permute the dimentions somewhere in between so that both are CxHxW (or HxWxC).
I’m sorry that I have not met this situation.
But if your image and labels are loaded by data loader, the format both default in NCHW, and we do not need to pay attention to the data format and do addition operations.
Thanks for your quick reply. But the output of the network is not an image and supposingly a combination of multi-class classification and regression of object detection probabilities and bounding boxes values.