# Gradient computation modified by an inplace operation

Hi, I have a forward method of an encoder and at the end I want to calculate the euclidean distance between each sequence element (like self-attention but the attention scores are the L2 norm). I have tried for hours but no matter how I calculate this autograd is not having it.

``````RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 128, 128]], which is output 0 of SqrtBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

``````

I know that it is this calculation that causes the error because if I return before it, the error doesn’t occur. Why does this error occur in my calulation?

``````def forward(self, src, src_pos):
src = self.src_embed(src, src_pos)

query = mem.unsqueeze(2)    # (B, M, 1, d_m)
key = mem.unsqueeze(1)      # (B, 1, N, d_m)

# Compute the squared Euclidean distance
squared_distance = torch.sum((query - key) ** 2, dim=-1)  # (B, N, N)

# Take the square root to get the Euclidean distance
euclidean_distance = torch.sqrt(squared_distance)
return euclidean_distance
``````

Hi Daniel!

The error message you posted gives two useful clues: First, the shape of
the tensor being modified inplace is `[1, 128, 128]`, and second, it is the
output of a `sqrt()` operation.

Such an error could occur in parts of your code that you haven’t posted, but
it could also be happening to `euclidean_distance`, the result of a `sqrt()`
operation. Is the shape of `euclidean_distance` `[1, 128, 128]`?

I would look at what happens to `euclidean_distance` after it is returned
from the `forward()` function you posted. Do you see any inplace operations
being applied to it in the code that is subsequently called?

If this is the issue, then returning a `clone()` of `euclidean_distance`, e.g.,
`return euclidean_distance.clone()`, might be sufficient to fix the problem.

For further suggestions about how to debug such issues, please see this post:

Best.

K. Frank

Hi Frank,
thanks a lot. You were right, returning euclidean_distance.clone() fixes the problem but I don’t understand why. After it is returned it only gets modified in a custom loss function (her it is ‘input’). The problem seems to be the sigmoid + clamping line which is strange because I used this loss for normal attention and it worked fine.

``````class DistanceLoss(nn.Module):
'''
Focal loss for binary classification with or without logits. Targets that are neither exactly 0 nor 1 are ignored.
Occurences of the two classes are auto-balanced when using reduction='mean'
'''
def __init__(self, gamma=2, sigmoid_clamp: float = 1e-4, reduction='mean'):
super().__init__()
self.gamma = gamma
self.sigmoid_clamp = sigmoid_clamp
self.reduction = reduction

def forward(self, input, target):
input_s = torch.clamp(input.sigmoid_(), min=self.sigmoid_clamp, max=1-self.sigmoid_clamp)
pos_idx = (target == 1)
neg_idx = (target == 0)
pos_loss = torch.pow(1 - input_s[pos_idx], self.gamma) * torch.log(input_s[pos_idx])
neg_loss = torch.pow(input_s[neg_idx], self.gamma)     * torch.log(1 - input_s[neg_idx])

if self.reduction == 'mean':
n_pos = pos_idx.sum()
n_neg = neg_idx.sum()
if n_pos == 0:
return -neg_loss.sum()/n_neg
else:
return -pos_loss.sum()/n_pos - neg_loss.sum()/n_neg
elif self.reduction == 'sum':
return -pos_loss.sum() - neg_loss.sum()
return -pos_loss, -neg_loss
``````

Edit: changing sigmoid_() to sigmoid() fixes the problem without needing euclidean_distance.clone(). Still, why isn’t this a problem with normal attention. Another problem is now that the loss is a number but some model parameters are set to NaN after the optimizer step. It is like the gradient can’t reach them. Could this be related to this?

`````` query = mem.unsqueeze(2)    # (B, M, 1, d_m)
key = mem.unsqueeze(1)      # (B, 1, N, d_m)
``````

Disallowing inplace operations (e.g. here the inplace `sigmoid_` usage) depends on the used operations and which tensors are needed in their original for gradient computation. `torch.sqrt` uses its `result` as seen here and your “normal attention” approach might use other operations not depending on the result for the gradient calculation.