Implicit tensor creation for autograd use

I have a bit of an odd question that requires some background information first.

EDIT: Google Colab

I’m currently working on research which uses backprop through discrete optimization algorithms like A* for shortest path, and have code that works perfectly for that. I’ll give short code examples, but can see about getting a complete example uploaded later if required.

Essentially, a network takes in a picture and outputs a NxN grid of values, which signifies the cost of traversing into that tile. These tiles, with a given start/goal location index are then used in a shortest path algorithm. To track gradients through the shortest path algorithm, the solver is ran twice with some noise. The output of the shortest path solver is binary mask of the path to travel.

def a_star(w, start, goal):
    ...

ShortestPath(torch.autograd.Function):
    ...
    def forward(ctx, w, start, goal):
        path_mask = a_star(w, start, goal)
        # save ctx.values ...
        return path_mask

    def backward(ctx, grad_output):
        # get values from ctx ...
        w_prime = w + grad_output * eps     # wiggle w with specific noise
        path_mask_prime = a_star(w_prime, start, goal)
        gradient = - (path_mak - path_mask_prime)
        return gradient, None, None

# running example
predicted_w = network(img)   # NxN tile costs, used as weights in A*
predicted_path = ShortestPath.apply(predicted_w, start, goal)
... 
loss = HammingLoss(predicted_path, actual_path_mask)
...

This works fine, and is a method which follows a previous paper. I have results which follow theirs, so I know this works.

Now, suppose that we are working with a large pathfinding problem, i.e., computing predicted_w might be wasteful, or maybe we want to use a more complicated network. Since the gradient of the ShortestPath is a difference of binary masks, we really only need to compute the tile costs which are relevant to our search, and not for the entire space (gradient is 0 for tiles which are not part of the solution path either of the 2 runs of the algorithm).

What I would like to do is initialize predicted_w as a zero tensor, use the network to output single cost values on the fly inside my a_star, and set the correct index of predicted_w using that output tensor, so that the gradients from the network are tracked (each index of predicted_w would have come from a separate output from the network).

def a_star(w, start, goal, network, noise):
    ... 
    tile_cost = network(img, ...)
    w[x, y] = tile_cost
    ...

ShortestPath(torch.autograd.Function):
    ...
    def forward(ctx, w, start, goal, network):
        path_mask = a_star(w, start, goal, network, None)
        # save ctx.values ...
        return path_mask

    def backward(ctx, grad_output):
        # get values from ctx ...
        noise = grad_output * eps     # wiggle w with specific noise
        path_mask_prime = a_star(w, start, goal, network, noise)
        gradient = - (path_mak - path_mask_prime)
        return gradient, None, None, None

# running example
predicted_w = torch.zeros()
predicted_path = ShortestPath.apply(predicted_w, start, goal, network)
...

This sets the correct values for predicted_w, but when I run loss.backward(), I get gradient values of None for the layers of by network. Is what I am trying to do possible? Again, I can try to get a running gist if that would help.

Here is a colab link with the output for 2 tests at the bottom. One test is the normal way of running, and the second is with the implicit tensor calculation. As you can see, the gradients aren’t being tracked in the 2nd test.