In-place operation in training time

_guyyariv · January 22, 2023, 10:12am

The situation is as follows:
I have 3 blocks of NNs: A, B and C. B is nn.Embedding
I want A’s output will replace only one embedding in B. for instance: B[idx]=A_output
and I want that only A will train (B&C parameters freezes):

for step, batch in enumerate(train_dataloader):
    A_output = A(batch['input_1'])
    B.weight.data[index] = torch.zeros(A_output.shape)
    B.weight[idx] = B.weight[idx] + A_output
    B_output = B(batch['input_2'])
    C_output = C(B_output)
    loss = F.mse_loss(C_output, batch['target'], reduction="mean")
    loss.backward(retain_graph=True)
    optimizer.step()
    optimizer.zero_grad()

I’ve got:
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [512, 768]], which is output 0 of AsStridedBackward0, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Any ideas?

Thanks

ptrblck · January 23, 2023, 12:28am

These lines of code are most likely causing the issue:

    B.weight.data[index] = torch.zeros(A_output.shape)
    B.weight[idx] = B.weight[idx] + A_output

since you are directly manipulating the internal .data attribute (which is deprecated as it could yield unwanted side effects) and manipulating the parameter inplace.
I don’t fully understand your use case since A_output would have a shape of [batch_size, *] (where * denotes additional dimensions) while the embedding matrixin B will have the shape [num_embeddings, embedding_dim], which thus does not have a dependency on the batch size.
Are you forcing the training to work with a batch size of 1 or how do you guarantee that the assignment would work?