CenterNet Network Output


I am new to object detection and I was reading the centernet paper titled “CenterNet Keypoint Triplets for Object Detection” and I was left with some doubts.

  1. I am familiar with CNN’s am the general principle of image convolution-bn-maxpool until the features flattening until sigmoid/softmax classification. However, from what I understand object detection adapts an autoencoder backbone right?

  2. the so-called keypoints are an output of the encoder-decoder backbone? or the output is just an image reconstructed from the decoder?

  3. is there any coordinates as output of the decoder?

  4. The part of heatmaps, center pooling and offsets are purely an image segmentation process?

  5. loss function is always pixel-wise? the ground truth is the original image? or is it the label of the original image?

Thank you in advance for your time and help