Hello
I am new to object detection and I was reading the centernet paper titled “CenterNet Keypoint Triplets for Object Detection” and I was left with some doubts.
-
I am familiar with CNN’s am the general principle of image convolution-bn-maxpool until the features flattening until sigmoid/softmax classification. However, from what I understand object detection adapts an autoencoder backbone right?
-
the so-called keypoints are an output of the encoder-decoder backbone? or the output is just an image reconstructed from the decoder?
-
is there any coordinates as output of the decoder?
-
The part of heatmaps, center pooling and offsets are purely an image segmentation process?
-
loss function is always pixel-wise? the ground truth is the original image? or is it the label of the original image?
Thank you in advance for your time and help