As per my understanding,in fully convolutional architectures the output is a feature map of the same dimension as that of the image.
For eg: in yolo the output is of dimension HxWx[Number of anchor boxes +classes]
With that in mind,I decided to write my own version of a fully convolutional architecture that classifies wether an object in the image belongs to any of the three classes C1,C2,C3.
I would like to know if my below mentioned approach is correct,and if not what can be done to improve it.
class FCNN2(nn.Module):
def __init__(self):
super(FCNN2, self).__init__()
# Learnable layers
self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
nn.init.kaiming_normal(self.conv1.weight)
self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
nn.init.kaiming_normal(self.conv2.weight)
self.conv3 = nn.Conv2d(in_channels=32, out_channels=16, kernel_size=3, padding=1)
nn.init.kaiming_normal(self.conv3.weight)
self.deconv = nn.ConvTranspose2d(in_channels=16, out_channels=16, kernel_size=3, stride=2, padding=1, output_padding=1)
nn.init.kaiming_normal(self.deconv.weight)
self.conv4 = nn.Conv2d(in_channels=16, out_channels=3, kernel_size=5, padding=2)
nn.init.kaiming_normal(self.conv4.weight)
self.avg_pool=nn.AdaptiveAvgPool2d((1,1))
def forward(self, x):
# x.size() = (N, 3, W, W)
x = F.relu(self.conv1(x))
# x.size() = (N, 16, W, W)
x = F.relu(self.conv2(x))
# x.size() = (N, 32, W, W)
x = F.max_pool2d(x, (2,2))
# x.size() = (N, 32, W/2, W/2)
x = F.relu(self.conv3(x))
# x.size() = (N, 16, W/2, W/2)
x = self.deconv(x)
# x.size() = (N, 16, W, W)
x = self.conv4(x)
x=self.avg_pool(x)
# x.size() = (N, 2, W, W)
return x
model=FCNN2()
img=cv2.imread('model_152_3.jpg')
#img=cv2.resize(img,(224,224))
img=img[:,:,::-1]
img=img.astype(np.float32)/255
img=torch.from_numpy(img).permute(2,0,1).unsqueeze(0)
x=model(img)
Apart from the architecture,I am confused as to what does the output shape (1,3,1,1) mean,does this imply the final output is a feature vector of dimension 3x1.
Because in standard anchor based object detection algorithms,the feature map is associated with a feature vector of N dimension which gives us information about bounding box coordinates and classes.