One of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [128, 1]

KFrank · September 4, 2023, 1:52am

Hi Adele!

apunt:

  File "/Users/apunt/repos/lhasa/FLuID_POC/attempt4.py", line 35, in forward
    out = self.fc2(out)
...
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [128, 1]], which is output 0 of AsStridedBackward0, is at version 6251; expected version 1 instead.

This is telling you that fc2.weight is the tensor that is being modified inplace.
Note that its ._version has jumped to 6251, indicating that it’s been modified
inplace 6250 times before the call to .backward() that is triggering the error.

You need to think through your training algorithm. As we’ll see below, you compute
a loss, call a training loop (that presumably calls an opt.step() multiple times),
and then call loss.backward(). opt.step() modifies inplace the parameters that
it is optimizing, causing the error.

Does it really make sense for your algorithm to stick a training loop in between
the computation of loss and the call to loss.backward()?

The inplace operation is presumably a call to opt.step() that isn’t shown in
the code you’ve posted, but is presumably buried somewhere in:

    trainer.run(train_dataloader, max_epochs=params.num_epochs)

This creates fc2 with a weight of shape [1, 128]. The inplace-modification error
reports the transpose of this shape, namely, [128, 1].

and ss_utils.py:

"""
This module contains some helper functions for our main notebook
"""
...
def train_neural_network(train_df: pd.DataFrame, val_df: pd.DataFrame,
...
    """
    Trains a PyTorch NN module on train dataset, validates it each epoch and returns a series of useful metrics
    for further analysis. Note the networks parameters will be changed in place.

Note that this comment in the code you posted warns that some “network
parameters will be changed in place,” tipping you off to the possibility of an
inplace-modification error.

We noted above that the error message tells you that fc2.weight was
modified inplace 6250 times. This would makes sense if your inner training
loop calls some opt.step() 6250 times as it iterates.

For some examples that illustrate how to debug inplace-modification errors,
see this post:

"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further a autograd

Hi Fahmyadan and Sangyoon! Here are some suggestions about how to track down (and maybe fix) inplace-modification errors. Note that an inplace modification in the forward pass is not necessarily* an error – it depends on whether and how the tensor that was modified is used in the backward pass. Note that inplace operations can be useful for saving memory – if you replace an innocent inplace operation with an out-of-place equivalent, your training will use more memory (and, to a minor e…

Good luck!

K. Frank