How to multiply a tensor with itself?

I am incorporating some custom functions into nanoGPT. Performing a linear combination just works fine:

def static_combination(batch):
  sequence_length, dimension = batch.size()
  for sequence in range(1, sequence_length):
    scalar = 1.0/(sequence+1)
    batch[sequence] = torch.add(batch[sequence]*scalar, batch[sequence-1], alpha=(1-scalar))
  return batch

Changing the operation to a multiplication yields a RunTimeError:

# multiply the token with the previous 
def token_multiplication(batch):
  sequence_length, token_dimension = batch.size()
  for token_number in range(1, sequence_length):
    batch[token_number] = torch.mul(batch[token_number], batch[token_number-1])
  return batch
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2, 128]], which is output 0 of AsStridedBackward0, is at version 63; expected version 62 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

With torch.autograd.set_detect_anomaly(True):

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2, 128]], which is output 0 of AsStridedBackward0, is at version 63; expected version 62 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later.

As far as I know retain_graph=True isn’t used.
The environment setting is:

  • transformers version: 4.40.2
  • Platform: Linux-6.1.85±x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (False)
  • Tensorflow version (GPU?): 2.15.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.8.3 (cpu)
  • Jax version: 0.4.26
  • JaxLib version: 0.4.26
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?:No

Detaching and cloning the previous tensor seems to solve the issue:

def token_multiplication(batch):
  sequence_length, token_dimension = batch.size()
  for token_number in range(1, sequence_length):
    # copy the result for the backpropagation
    batch[token_number] = torch.mul(batch[token_number], batch[token_number-1].detach().clone())
  return batch

Is this the correct way for the backpropagation?

Hi Wags!

You almost certainly don’t want to do this.

Try clone()ing batch[token_number] (without a .detach()), rather than
.detach()ing batch[token_number - 1].

My take on what is going on:

An inplace-modification error happens when two things occur: first, a tensor from
the forward pass is needed to compute gradients during the backward pass; and
second, that tensor is modified inplace. Note that assigning into a tensor using
indexing or slicing counts as an inplace modification.

With your original:

batch[token_number] = torch.mul(batch[token_number], batch[token_number-1])

in order to compute the gradient with respect to batch[token_number - 1],
.backward() needs to know what batch[token_number - 1] was multiplied by,
namely batch[token_number], which is to say, the whole batch tensor. But you
modify batch inplace, hence the error.

When you .detach() batch[token_number - 1], .backward() no longer tries to
compute this part of the gradient, so batch (in particular, batch[token_number])
is no longer needed and it doesn’t matter that it was modified inplace, so the error
goes away.

No, because you are no longer computing the piece of the gradient that comes
from batch[token_number - 1]. You probably do want this part of the gradient,
so your “fix” is probably wrong.

By .clone()ing batch[token_number] (before overwriting it by modifying
batch inplace), you let the computation graph hold on to (the unmodified)
batch[token_number], which is then used by .backward() to compute the
gradient.

Best.

K. Frank