Hello!
I’m sorry if this is kind of a noob question, but I am unable to find an answer to this anywhere.
I want to train a pose detection network, which outputs a confidence score and 3D coordinates of specific landmarks. The data has both of these labels as well: a flag whether something detectable is on the image, and if so, the 3D coordinates it should detect.
I now want to calculate the loss for these values. My thinking was, that I need to have a binary classification loss for the confidence, and a regression loss for the coordinates. However, when nothing detectable is present, I obviously do not want to have any gradients for the coordinates, as there is no correct way to change them, but I do want gradients for the confidence, that something is present. Is it okay to just mask out those values where confidence should be 0 within the keypoint tensor? Does this still calculate the right gradients? Or is it an entirely wrong approach?
Just to visualize better, here is a code snippet of what the loss calculation would look like:
class CoordinateLoss(nn.Module):
def forward(self, prediction, target):
confidence_loss_fn = nn.BCELoss()
keypoint_loss_fn = nn.MSELoss()
detection_possible_mask = target[:, 0] == 1
confidence_loss = confidence_loss_fn(
prediction[0][:, 0], target[:, 0].float())
keypoint_loss = keypoint_loss_fn(
prediction[2][detection_possible_mask], target[:, 2:][detection_possible_mask].float())
return confidence_loss + keypoint_loss
Thanks a lot for your help!