Pretrained model keypoint detection is outputting too much tensors

I am trying to use the pretrained model keypoint rcnn model to detect poses, and make use of the poses results, but the keypoints output a tensor of (2,17,3), the 17 is number of joints, 3 is the coordinates and visibility, but what is that 2 for? I supposed it is the number of persons in the image but my image has only 1 person, so why is that?? I want to utilize the poses data but confused which one is the correct

For the input to the network in target[“labels”] what have you passed as the number of class labels?