A More Forgiving Vision NN Model?


I’m training a conv neural network to recognize features in an image using a similar method to this paper: Stacked Hourglass

The targets I am using are artificial heatmaps generated from the feature (x, y) locations. The reason I am using heatmaps instead of single points is that I want the training to be more forgiving. I just need the heatmap’s maximum value near the feature location.

However, the network is training itself TO the heatmap - so I end up with a blobby looking output that isn’t recognizing what I want it to.

What I want to try instead is training the model to the L2 between the feature location, and the maximum value location of the convolutional network output.

What I need is help converting the network output to the right format.

input_image = [B, C, H, W]
target = [B, C, loc] # where loc = [y, x]

I can calculate the maximum no problem…

def maximum(tensor):
    assert len(tensor.size()) == 4, "Tensor must be of size 4 [BxCxHxW]"

    batches = []
    for b in tensor:
        joints = []
        for c in b:
            s = c.size()
            maxValue = 0
            ymax = 0
            xmax = 0
            for y in range(s[0]):
                for x in range(s[1]):
                    value = c[y,x]
                    if value > maxValue:
                        maxValue = value
                        ymax = y
                        xmax = x
            t = [ymax, xmax]
    maxT = torch.Tensor(batches)
    return maxT

But that function breaks the backpropogation (I get this error:)

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Therefore I need to use the torch.max() function, but I can’t for the life of me figure out how to do so, since it calculates maximums over only one dimension at a time…

So, how do I convert [B, C, H, W] to [B, C, 2] but maintain the gradient so the loss function and optimizer can operate normally?

Ideally I’d like for the model to also be exportable to ONNX and hopefully then to CoreML.


Two quick comments:

  • Use max(inp.view(B, C, -1), 2) and numpy.unravel_index (or just // and %) after.
  • You cannot differentiate argmax, it’s mathematically impossible. People suggested various things (e.g. with softargmax or argsoftmax - I cannot remember - that was defined as torch.arange(W*H).view(1, 1, W * H)*torch.softmax(inp.view(B, C, -1), dim=2) or somesuch. It’s not always entirely clear how to interpret that, but it seems that some people like it.

Best regards


Thank you for your comments.

I’m new to ML (background is as a programmer) so this helps point me in the right direction and understand the model’s limits.