Fully convolutional architecture for classification purposes

As per my understanding,in fully convolutional architectures the output is a feature map of the same dimension as that of the image.
For eg: in yolo the output is of dimension HxWx[Number of anchor boxes +classes]

With that in mind,I decided to write my own version of a fully convolutional architecture that classifies wether an object in the image belongs to any of the three classes C1,C2,C3.

I would like to know if my below mentioned approach is correct,and if not what can be done to improve it.

class FCNN2(nn.Module):

    def __init__(self):
        super(FCNN2, self).__init__()
        # Learnable layers
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=16, kernel_size=3, padding=1)
        self.deconv = nn.ConvTranspose2d(in_channels=16, out_channels=16, kernel_size=3, stride=2, padding=1, output_padding=1)
        self.conv4 = nn.Conv2d(in_channels=16, out_channels=3, kernel_size=5, padding=2)        

    def forward(self, x):
        # x.size() = (N, 3, W, W) 
        x = F.relu(self.conv1(x)) 
        # x.size() = (N, 16, W, W) 
        x = F.relu(self.conv2(x))
        # x.size() = (N, 32, W, W) 
        x = F.max_pool2d(x, (2,2))
        # x.size() = (N, 32, W/2, W/2)
        x = F.relu(self.conv3(x))
        # x.size() = (N, 16, W/2, W/2)
        x = self.deconv(x)
        # x.size() = (N, 16, W, W)
        x = self.conv4(x)
        # x.size() = (N, 2, W, W)
        return x

Apart from the architecture,I am confused as to what does the output shape (1,3,1,1) mean,does this imply the final output is a feature vector of dimension 3x1.
Because in standard anchor based object detection algorithms,the feature map is associated with a feature vector of N dimension which gives us information about bounding box coordinates and classes.

I think your understanding of FCN is correct. But in a standard FCN, they also need Deconvolution and bilinear upsampling in the Decoder to retain the original shape, while in e.g. YOLO, there isn’t such Decoder, and their output HxW is actually proportional to the input size (something like input_H/16 x input_W/16 or so).

The output of your model is a feature map of shape 1x1, 3 is the number of channel. PyTorch 2D tensor is in the format of batch x channel x H x W. So if you want to do classification, you can just squeeze the spatial dimension by out = x[:, :, 0, 0] and that gives you a 1 x 3 vector which is the predicted logits for each class.