RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation (New)

Hi I am stuck with an error and I do not know what I am looking for. I have seen previous posts with a similar error but their fixes did not apply for my case.

The error summary:

RuntimeError: one of the variables needed for gradient computation has been modified by an in place
operation: [torch.cuda.HalfTensor [64, 32]], which is output 0 of TBackward, is at version 2; expected
version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its
gradient. The variable in question was changed in there or anywhere later. Good luck!```

Full error:

[W python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward. Traceback of forward call that caused the error:
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/train.py”, line 123, in
train.fit(n_epochs=n_epochs, vb_size=vb_size, train_dataset=train_dataset,
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/framework.py”, line 111, in fit
for i, data in (enumerate(tqdm(self.train_loader)) if self.verbose >= 2 else enumerate(self.train_loader)):
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/tqdm/std.py”, line 1180, in iter
for obj in iterable:
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py”, line 521, in next
data = self._next_data()
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/data/dataloader.py”, line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/dataset.py”, line 81, in getitem
return self.load_file(scene_idx)
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/dataset.py”, line 118, in load_file
mesh.fusion(features[i], view[:,:-1], view[:,-1].long(), depth[i])
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/mesh/mesh.py”, line 150, in fusion
self.fc_fusion(index, latent_features_flat)
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/mesh/mesh.py”, line 309, in fc_fusion
self.feature_volume[inds_0, inds_1, inds_2, :] = self.fusion_learner(torch.cat(inputs_i), self.feature_volume[inds_0, inds_1, inds_2, :])
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/learned_fusion.py”, line 13, in forward
return self.fc(a)
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/linear.py”, line 96, in forward
return F.linear(input, self.weight, self.bias)
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/functional.py”, line 1847, in linear
return torch._C.nn.linear(input, weight, bias)
(function print_stack)
0%| | 0/600 [00:22<?, ?it/s]
Traceback (most recent call last):
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/train.py”, line 123, in
train.fit(n_epochs=n_epochs, vb_size=vb_size, train_dataset=train_dataset,
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/framework.py”, line 140, in fit
loss = self.train_step(inputs[idx], torch.tensor(targets[idx], device=self.device, dtype=torch.long), optimizer, last=last
)
File “/cluster/work/riner/patakiz/asl/asldoc-2021-SA-Zador-How-to-fuse/framework.py”, line 213, in train_step
loss.backward(retain_graph=retain_graph
)
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/_tensor.py”, line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/cluster/home/patakiz/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/autograd/init.py”, line 147, in backward
Variable._execution_engine.run_backward(

I am using two neural networks. self.net and self.fusion_learner. These are stored in my Framework class which handles most things. Additionally I have a dataset where I store copy of self.fusion_learner as I am nesting the fusion learner network in a data loader since I am in part doing learned data preprocessing.

self.net:

class Task(nn.Module):
    def __init__(self, fusion=None, n_mult=None, kmeans_strat=None):
        super(Task, self).__init__()
        mult = 1

        self.conv1 =  nn.Conv3d(32*mult, 16, kernel_size=[3,3,3], stride=2, padding=0, bias=False)
        self.bnorm1 = nn.BatchNorm3d(16)
        self.relu = nn.ReLU(False)
        self.conv2 = nn.Conv3d(16, 16, kernel_size=1, stride=1, padding=0, bias=False)
        self.dropout = nn.Dropout3d(0.2)

        self.conv3 = nn.Conv3d(16, 8, kernel_size=[3,3,3], stride=2, padding=0, bias=False)
        self.bnorm2 = nn.BatchNorm3d(8)
        self.conv4 = nn.Conv3d(8, 8, kernel_size=1, stride=1, padding=0)
        self.conv5 = nn.Conv3d(8, 8, kernel_size=1, stride=1, padding=0)

        self.convtranspose5 = nn.ConvTranspose3d(8, 8, kernel_size=1, stride=1, padding=0)
        self.convtranspose4 = nn.ConvTranspose3d(8, 8, kernel_size=1, stride=1, padding=0)
        self.convtranspose3 = nn.ConvTranspose3d(8, 16, kernel_size=[3,3,3], stride=2, padding=0)
        self.convtranspose2 = nn.ConvTranspose3d(16, 16, kernel_size=1, stride=1, padding=0)
        self.convtranspose1 = nn.ConvTranspose3d(16, 32, kernel_size=[3,3,3], stride=2, padding=0)
        self.convtranspose_out = nn.Conv3d(32, 14, 1, 1, 0)

    def forward(self, feature_volume):
        x = self.relu(self.bnorm1(self.conv1(feature_volume)))
        x = self.relu(self.bnorm1(self.conv2(x)))
        x = self.relu(self.bnorm1(self.conv2(x)))
        x = self.relu(self.bnorm2(self.conv3(x)))
        x = self.relu(self.bnorm2(self.conv4(x)))
        x = self.relu(self.bnorm2(self.conv5(x)))
        x = self.dropout(x)

        x = self.relu(self.convtranspose5(x))
        x = self.relu(self.convtranspose4(x))
        x = self.relu(self.convtranspose3(x))
        x = self.relu(self.convtranspose2(x))
        x = self.relu(self.convtranspose2(x))
        x = self.relu(self.convtranspose1(x))
        x = self.convtranspose_out(x)
        return x

self.fusion_learner:

class FC_fusion(nn.Module):
    def __init__(self):
        super(FC_fusion, self).__init__()

        self.fc = nn.Linear(64, 32)

    def forward(self, X_new, X_fused):
        if X_fused.dim()==1:
            X_fused = X_fused[None,:]
        a = torch.cat([X_new, X_fused], dim=1)
        return self.fc(a)

Forward pass of self.net is not interesting: self.net(input).
I pass through self.fusion_learner many times:

...
for i in range(np.max(batches_len)):
    inputs_i = []
    sorted_indices_i = []
    for j in range(len(batches_len)):
        if i < batches_len[j]:
            inputs_i.append(packed_inputs_list[j][bs_sum[j]:batch_sizes_list[j][i]])
            bs_sum[j]+=batch_sizes_list[j][i]
            sorted_indices_i.append(sorted_indices_list[j][:inputs_i[-1].shape[0]])
    sorted_indices_i = torch.cat(sorted_indices_i)
    inds_0, inds_1, inds_2 = indices[0][sorted_indices_i], indices[1][sorted_indices_i], indices[2][sorted_indices_i]

    self.feature_volume[inds_0, inds_1, inds_2, :] = self.fusion_learner(torch.cat(inputs_i), self.feature_volume[inds_0, inds_1, inds_2, :])

In fact, I run this multiple times before any backpropagation. Here the output of the net updates this feature volume, which is used as an input to the net in addition to a new vector. As a reminder, all of this is done inside of the DataLoader (i.e. this is done in the . In the end the “getitem” function (cant do __ __ without bold).

Here is how I define my optimizer:

optimizer = Adam(list(self.net.parameters()) + list(self.fusion_learner.parameters()), eps=1e-04)

and here is how I do the optimization step:

optimizer.zero_grad()
pred_flat, target_flat = self.net_step(inputs, classes)
loss = self.crit(pred_flat, target_flat)
if last == False:
    retain_graph_=True
else:
    retain_graph_=False
loss.backward(retain_graph=retain_graph_)
optimizer.step()

Of course throughout this process there are many more lines of code (self.net_step contains self.net() for example) but I cant include everything.

Does anybody have any ideas?

Based on the stacktrace I would guess that this inplace manipulation is disallowed:

self.feature_volume[inds_0, inds_1, inds_2, :] = self.fusion_learner(torch.cat(inputs_i), self.feature_volume[inds_0, inds_1, inds_2, :])

so try to replace it with assigning the result to a new tensor instead.