Gradcheck modifying tensors in-place?

I’m writing a unit test for a custom op. I’ve had no issues running the op, except when I try to run it inside a torch.autograd.gradcheck. Here’s the test function:

    x, x_len = range_tensor(100)
    y, y_len = range_tensor(100)
    x.requires_grad = True
    torch.autograd.gradcheck(self.custom_op, (x, y, x_len, y_len))

I get the following failure:

  File "/mnt/data/code/custom_op.py", line 242, in backward
    D, R, X_len, Y_len = ctx.saved_tensors
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 102, 102]] is at version 2; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The backtrace points to my custom op (as expected). In the forward function, I have:

  ctx.save_for_backward(D, R, X_len, Y_len)

To make absolutely certain that I’m not accidentally making in-place modifications to any tensors, I’ve added .clone() to each of D, R, X_len, and Y_len. I’m still getting the same exception.

Does anyone have ideas on what’s going wrong here?

Can you share the source code for self.custom_op?

Here’s the op. The bulk of the work is done by a Numba kernel which is operating on plain CUDA memory.

class CustomOp(Function):
    @staticmethod
    def forward(ctx, D, X_len, Y_len):
        dev = D.device
        dtype = D.dtype
        stream = torch.cuda.current_stream().cuda_stream

        B = D.shape[0]
        N = D.shape[1]
        M = D.shape[2]

        n = torch.max(X_len).cpu().item()
        m = torch.max(Y_len).cpu().item()
        threads_per_block = min(max(n, m), MAX_THREADS_PER_BLOCK)

        R = torch.empty((B, N + 2, M + 2), device=dev, dtype=dtype).fill_(math.inf)
        R[:, 0, 0] = 0

        numba_forward[B, threads_per_block, stream](
            cuda.as_cuda_array(D.detach()),
            cuda.as_cuda_array(X_len),
            cuda.as_cuda_array(Y_len),
            cuda.as_cuda_array(R))

        ctx.n = n
        ctx.m = m
        ctx.save_for_backward(D, R, X_len, Y_len)

        return R[:, -2, -2].clone()

    @staticmethod
    def backward(ctx, grad_output, _):
        dev = grad_output.device
        dtype = grad_output.dtype
        stream = torch.cuda.current_stream().cuda_stream
        D, R, X_len, Y_len = ctx.saved_tensors

        B = D.shape[0]
        N = D.shape[1]
        M = D.shape[2]

        n = ctx.n
        m = ctx.m
        threads_per_block = min(max(n, m), MAX_THREADS_PER_BLOCK)

        D_ = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev)
        D_[:, 1:N + 1, 1:M + 1] = D

        for i in range(B):
          R[i, X_len[i] + 1, :] = -math.inf
          R[i, :, Y_len[i] + 1] = -math.inf

        E = torch.zeros((B, N + 2, M + 2), dtype=dtype, device=dev)

        # Grid and block sizes are set same as done above for the forward() call
        numba_backward[B, threads_per_block, stream](
            cuda.as_cuda_array(D_),
            cuda.as_cuda_array(R),
            cuda.as_cuda_array(X_len),
            cuda.as_cuda_array(Y_len),
            cuda.as_cuda_array(E))

        E = E[:, 1:N + 1, 1:M + 1]

        return grad_output.view(-1, 1, 1).expand_as(E) * E, None, None