Inplace error in PyTorch LSTM

term1112 · December 13, 2021, 5:30am

I am using an LSTM to summarize a trajectory as shown below. The shapes of features and rewards are [50, 1, 2048] and [50, 1, 1] respectively, i.e. the sequence length is 50 and the batch size is 1.

class RolloutEncoder(nn.Module):
    def __init__(self, config):
        super(RolloutEncoder, self).__init__()
        self._input_size = (
            2048 + 1
        )  # deter_state + imag_reward; fix and use config["deter_dim"] + 1
        self._hidden_size = config["rollout_enc_size"]
        self._lstm = nn.LSTM(self._input_size, self._hidden_size, bias=True)

    def forward(self, traj):
        features = traj["features_pred"]
        rewards = traj["reward_pred"].unsqueeze(1)
        input = torch.cat((features, rewards), dim=2)
        encoding, (h_n, c_n) = self._lstm(input)
        code = h_n.squeeze(0)
        return code

My training loop goes something like this:

encoder = RolloutEncoder(config)
for e in range(episodes):
      for step in range(steps):
            print(f"Step {steps})
            # calc traj
            code = encoder(traj)
            # some operations that do not modify code but only concat it with some other tensor
            # calc loss
            opt.zero_grad()
            loss.backward()
            opt.step()

On running, I get this error:

Step 0
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Step 9
Step 10
Step 11
Step 12
Step 13
Step 14
Traceback (most recent call last):
  File "/path/main.py", line 351, in <module>
    agent_loss.backward()
  File "/home/.conda/envs/abc/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/user/.conda/envs/abc/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 2049]] is at version 8; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

On setting the anomaly_detection to True, it point to this line in the encoder definition:

encoding, (h_n, c_n) = self._lstm(input)

I know this is a very common error but I am not using any inplace operation in my code and all the other discussions on the same topic couldn’t solve this. Moreover the error occurs after running some steps successfully which is really weird. On inspecting, I found that the [16, 2049] tensor is one of the weights of the LSTM. I also tried using dummy random tensors in place of features and rewards but the error persists, suggesting that the traj calculation has nothing to do with this error.

Why might this be happening and how can I solve this?