In-place operation in training time

The situation is as follows:
I have 3 blocks of NNs: A, B and C. B is nn.Embedding
I want A’s output will replace only one embedding in B. for instance: B[idx]=A_output
and I want that only A will train (B&C parameters freezes):

for step, batch in enumerate(train_dataloader):
    A_output = A(batch['input_1'])
    B.weight.data[index] = torch.zeros(A_output.shape)
    B.weight[idx] = B.weight[idx] + A_output
    B_output = B(batch['input_2'])
    C_output = C(B_output)
    loss = F.mse_loss(C_output, batch['target'], reduction="mean")
    loss.backward(retain_graph=True)
    optimizer.step()
    optimizer.zero_grad()

I’ve got:
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [512, 768]], which is output 0 of AsStridedBackward0, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Any ideas?

Thanks

These lines of code are most likely causing the issue:

    B.weight.data[index] = torch.zeros(A_output.shape)
    B.weight[idx] = B.weight[idx] + A_output

since you are directly manipulating the internal .data attribute (which is deprecated as it could yield unwanted side effects) and manipulating the parameter inplace.
I don’t fully understand your use case since A_output would have a shape of [batch_size, *] (where * denotes additional dimensions) while the embedding matrixin B will have the shape [num_embeddings, embedding_dim], which thus does not have a dependency on the batch size.
Are you forcing the training to work with a batch size of 1 or how do you guarantee that the assignment would work?