I’m training a conv neural network to recognize features in an image using a similar method to this paper: Stacked Hourglass
The targets I am using are artificial heatmaps generated from the feature (x, y) locations. The reason I am using heatmaps instead of single points is that I want the training to be more forgiving. I just need the heatmap’s maximum value near the feature location.
However, the network is training itself TO the heatmap - so I end up with a blobby looking output that isn’t recognizing what I want it to.
What I want to try instead is training the model to the L2 between the feature location, and the maximum value location of the convolutional network output.
What I need is help converting the network output to the right format.
input_image = [B, C, H, W] target = [B, C, loc] # where loc = [y, x]
I can calculate the maximum no problem…
def maximum(tensor): assert len(tensor.size()) == 4, "Tensor must be of size 4 [BxCxHxW]" batches =  for b in tensor: joints =  for c in b: s = c.size() maxValue = 0 ymax = 0 xmax = 0 for y in range(s): for x in range(s): value = c[y,x] if value > maxValue: maxValue = value ymax = y xmax = x t = [ymax, xmax] joints.append(t) batches.append(joints) maxT = torch.Tensor(batches) return maxT
But that function breaks the backpropogation (I get this error:)
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Therefore I need to use the
torch.max() function, but I can’t for the life of me figure out how to do so, since it calculates maximums over only one dimension at a time…
So, how do I convert
[B, C, H, W] to
[B, C, 2] but maintain the gradient so the loss function and optimizer can operate normally?
Ideally I’d like for the model to also be exportable to ONNX and hopefully then to CoreML.