Gradient computation modified by an inplace operation

Hi, I have a forward method of an encoder and at the end I want to calculate the euclidean distance between each sequence element (like self-attention but the attention scores are the L2 norm). I have tried for hours but no matter how I calculate this autograd is not having it.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 128, 128]], which is output 0 of SqrtBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I know that it is this calculation that causes the error because if I return before it, the error doesn’t occur. Why does this error occur in my calulation?

def forward(self, src, src_pos):
        src_mask = (src == -1)
        src = self.src_embed(src, src_pos)
        mem = self.encoder(src, src_key_padding_mask=src_mask)


        query = mem.unsqueeze(2)    # (B, M, 1, d_m)
        key = mem.unsqueeze(1)      # (B, 1, N, d_m)

        # Compute the squared Euclidean distance
        squared_distance = torch.sum((query - key) ** 2, dim=-1)  # (B, N, N)

        # Take the square root to get the Euclidean distance
        euclidean_distance = torch.sqrt(squared_distance)
        return euclidean_distance

Hi Daniel!

The error message you posted gives two useful clues: First, the shape of
the tensor being modified inplace is [1, 128, 128], and second, it is the
output of a sqrt() operation.

Such an error could occur in parts of your code that you haven’t posted, but
it could also be happening to euclidean_distance, the result of a sqrt()
operation. Is the shape of euclidean_distance [1, 128, 128]?

I would look at what happens to euclidean_distance after it is returned
from the forward() function you posted. Do you see any inplace operations
being applied to it in the code that is subsequently called?

If this is the issue, then returning a clone() of euclidean_distance, e.g.,
return euclidean_distance.clone(), might be sufficient to fix the problem.

For further suggestions about how to debug such issues, please see this post:

Best.

K. Frank

Hi Frank,
thanks a lot. You were right, returning euclidean_distance.clone() fixes the problem but I don’t understand why. After it is returned it only gets modified in a custom loss function (her it is ‘input’). The problem seems to be the sigmoid + clamping line which is strange because I used this loss for normal attention and it worked fine.

class DistanceLoss(nn.Module):
    '''
    Focal loss for binary classification with or without logits. Targets that are neither exactly 0 nor 1 are ignored.
    Occurences of the two classes are auto-balanced when using reduction='mean'
    '''
    def __init__(self, gamma=2, sigmoid_clamp: float = 1e-4, reduction='mean'):
        super().__init__()
        self.gamma = gamma
        self.sigmoid_clamp = sigmoid_clamp
        self.reduction = reduction

    def forward(self, input, target):
        input_s = torch.clamp(input.sigmoid_(), min=self.sigmoid_clamp, max=1-self.sigmoid_clamp)
        pos_idx = (target == 1)
        neg_idx = (target == 0)
        pos_loss = torch.pow(1 - input_s[pos_idx], self.gamma) * torch.log(input_s[pos_idx])
        neg_loss = torch.pow(input_s[neg_idx], self.gamma)     * torch.log(1 - input_s[neg_idx])

        if self.reduction == 'mean':
            n_pos = pos_idx.sum()
            n_neg = neg_idx.sum()
            if n_pos == 0:
                return -neg_loss.sum()/n_neg
            else:
                return -pos_loss.sum()/n_pos - neg_loss.sum()/n_neg
        elif self.reduction == 'sum':
            return -pos_loss.sum() - neg_loss.sum()
        return -pos_loss, -neg_loss

Edit: changing sigmoid_() to sigmoid() fixes the problem without needing euclidean_distance.clone(). Still, why isn’t this a problem with normal attention. Another problem is now that the loss is a number but some model parameters are set to NaN after the optimizer step. It is like the gradient can’t reach them. Could this be related to this?

 query = mem.unsqueeze(2)    # (B, M, 1, d_m)
 key = mem.unsqueeze(1)      # (B, 1, N, d_m)

Disallowing inplace operations (e.g. here the inplace sigmoid_ usage) depends on the used operations and which tensors are needed in their original for gradient computation. torch.sqrt uses its result as seen here and your “normal attention” approach might use other operations not depending on the result for the gradient calculation.