Inplace operation RuntimeError caused by loss defined in this function

perceptron1 · August 1, 2021, 3:39am

Hi,

I am getting the following error, and torch.autograd.set_detect_anomaly(True) is not able to give me the location of the error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [512, 65]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The function below is causing the problem, because as soon as I add the loss defined in this function to the rest of the losses I am optimizing for, it throws the error. Does anyone know if I have an in-place operation here?

I am optimizing a few times (it’s a sequential process where I optimize every t iterations) and I have loss.backward(retain_graph=True). It specifically complains at the second time I perform a backward call.

The network I am backpropagating through outputs the variable canvas.

Thanks!

    def calculate_random_windows_loss(self, canvas, target, windows):
        """
        Calculate error maps between canvas and target.
        windows: [batch, num_windows, 3, 128, 128]
        canvas and target: [batch, 3, 128, 128]
        """
        # 1) Unsequeeze and expand canvas and target
        num_windows = windows.shape[1]
        nonzero = torch.count_nonzero(windows, dim=(2,3,4))
        canvas_ = canvas.unsqueeze(1).expand(-1, num_windows, -1, -1, -1) # same shape as windows
        target_ = target.unsqueeze(1).expand(-1, num_windows, -1, -1, -1)
        
        # 2) Calculate error maps -> sum across channel (2nd dim) and pixels (3rd and 4th dimension)
        error_maps = torch.nn.functional.mse_loss(canvas_ * windows, target_ * windows, reduction='none').sum((2,3,4)) / nonzero  # [batch, total_windows]
        window_loss_and_index = torch.topk(error_maps, k=1)
        window_loss = window_loss_and_index[0]
        window_idx = window_loss_and_index[1]
        
        return window_loss, window_idx

tom · August 1, 2021, 11:23am

Are you sure you actually need that? I’m asking because many users are tempted to enable it prematurely by that error message.

Best regards

Thomas

perceptron1 · August 1, 2021, 12:53pm

Hi Thomas,

Yes, I need it, otherwise I have the typical trying to backward through the graph a second time error. I have a loop where I update the canvas every timestep (forward pass), other variables might not change during the loop, like the target during the entire sequence or the windows during, say 5 timesteps, (then I recalculate the windows). But windows are just masks,

The algorithm works fine without this new loss I am trying to add, but as soon as I add it I get the error, very strange.

tom · August 1, 2021, 6:32pm

So I will not insist on this beyond further in this thread, but my exact observation is that

otherwise I have the typical trying to backward through the graph a second time error.

is generally not a good reason to use retain_graph=True. Personally, I always try to articulate it very explicitly “I need to retain the graph for the calculation of FOO from BAR across training steps because of BAZ”. It seems silly, but more often than not, I’ve seen people discover they don’t need retain_graph=True after all. But as you were happy with it, I’ll leave it at this general comment.

So a bit of background: Autograd will save inputs and/or outputs of the forward in order to compute the backward, but it will only save those actually needed (modulo coding errors, which are unlikely in stock PyTorch functions).
This means that there are two ways of introducing a fresh trying to backward...error:

The perhaps more obvious one: Introduce a new inplace operation.
The perhaps more subtle one: Change the computation of something (or the computation with something) that is modified inplace from something that doesn’t need the output (or input, respectively) for computing the backward to something that does need it for computing the backward. So in this case, not the inplace is new, but how you compute what gets changed inplace (or what you compute with it).

Usually, the anomaly detection should give you a hint, in particular if it points to a new line of code, it could be the second case.

Best regards

Thomas

perceptron1 · August 1, 2021, 7:45pm

Hi Thomas,

Sure, I understand the general concern for knowing when to do retain_grap=True. I agree with you. I skipped the explanation of why I need it because I think its irrelevant to the current problem (I’ve been working on the same code for a while, with satisfactory results, and just encountered this error when adding a new custom loss). But thanks for your emphasis on the retain_graph=True issue, I think your logic should always be applied.

I think my problem is more likely your second bullet point.

One question that could potentially be related to the problem I have:

The current setup I have works like a recurrent neural network. The same network is called T (say T=50) times and the output of the network at time t (canvas_t) will be the input (concatenated with the same target image for the whole sequence T) of the network at time t+1. Every t%k (k is a hyperparameter, say k=10), I call backward on the cumulative losses at each t and peform a optimization step. If I detach the gradient of the output canvas_t after feeding it into the same network in the following time step, how is the network still able to backpropagate the gradients?

Thanks,

tom · August 1, 2021, 8:25pm

The thing that seems to be typically done with this under the name “through truncated time” aka BPTT is to “detach everything” that is carried over between the backwards.
Part of that is the exploding/vanishing gradient problem and the other part is the administrative component we are discussing here.

perceptron1 · August 1, 2021, 8:46pm

Would you mind explaining why it is a good practice to detach everything between the backwards calls? How does it work? If the explanation of that would take a lot of your time, would you mind sharing some links or pointing me to some further reading?

Thanks a lot!