RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [256, 10, 128]] is

I‘m doing a seq2seq project, but when I run my project ,something like this ocurrs. I have read the similar topic,but still don’t know how to fix it. I’ve stuck here days, hope someone can help me.THANKS A LOT~~
The full error message is down below:

Traceback (most recent call last):
  File "C:/Users/cqf/Desktop/试验/", line 172, in <module>
  File "C:/Users/cqf/Desktop/试验/", line 158, in run
  File "C:\Users\cqf\Desktop\试验\", line 85, in train_epoch
  File "C:\Users\cqf\Desktop\试验\", line 154, in train_batch
  File "F:\Anaconda3\envs\Mypycharm\lib\site-packages\torch\", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "F:\Anaconda3\envs\Mypycharm\lib\site-packages\torch\autograd\", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [256, 10, 128]] is at version 47; expected version 46 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

As you can see, I have already opened the anomous mode,but kinda still confused.
My code is down below:

def train_batch(
    x, bl_val = baseline.unwrap_batch(batch)
    x = move_to(x, opts.device)
    bl_val = move_to(bl_val, opts.device) if bl_val is not None else None

    # Evaluate model, get costs and log probabilities
    cost, log_likelihood = model(x)

    # Evaluate baseline, get baseline loss if any (only for critic)
    bl_val, bl_loss = baseline.eval(x, cost) if bl_val is None else (bl_val, 0)

    # Calculate loss
    reinforce_loss = ((cost - bl_val) * log_likelihood).mean()
    loss = reinforce_loss + bl_loss

    # Perform backward pass and optimization step
    with torch.autograd.set_detect_anomaly(True):

    # Clip gradient norms and get (clipped) gradient norms for logging
    grad_norms = clip_grad_norms(optimizer.param_groups, opts.max_grad_norm)

The forward function is down below:

 def forward(self, input, return_pi=False):
        :param input: (batch_size, graph_size, node_dim) input node features or dictionary with multiple tensors
        :param return_pi: whether to return the output sequences, this is optional as it is not compatible with
        using DataParallel as the results may be of different lengths on different GPUs

        if self.checkpoint_encoder and  # Only checkpoint if we need gradients
            embeddings, _ = checkpoint(self.embedder, self._init_embed(input))
            embeddings, _ = self.embedder(self._init_embed(input))

        _log_p, pi, cost = self._inner(input, embeddings)

        init_lengths, mask = self.problem.get_costs(input, pi)
        final_lengths = cost + init_lengths[:,None]

        # Log likelyhood is calculated within the model since returning it per action does not work well with
        # DataParallel since sequences can be of different lengths
        ll = self._calc_log_likelihood(_log_p, pi, mask)
        if return_pi:
            return final_lengths.squeeze(), ll, pi

        return final_lengths.squeeze(), ll

the _inner function which is the decode process is down below:

 def _inner(self, input, embeddings):

        outputs = []
        sequences = []

        state = self.problem.make_state(input)

        # Compute keys, values for the glimpse and keys for the logits once as they can be reused in every step
        fixed = self._precompute(embeddings)

        batch_size = state.ids.size(0)

        # Perform decoding steps
        i = 0
        while not (self.shrink_size is None and state.all_finished()):

            if self.shrink_size is not None:
                unfinished = torch.nonzero(state.get_finished() == 0)
                if len(unfinished) == 0:
                unfinished = unfinished[:, 0]
                # Check if we can shrink by at least shrink_size and if this leaves at least 16
                # (otherwise batch norm will not work well and it is inefficient anyway)
                if 16 <= len(unfinished) <= state.ids.size(0) - self.shrink_size:
                    # Filter states
                    state = state[unfinished]
                    fixed = fixed[unfinished]

            log_p, mask = self._get_log_p(fixed, state)

            # Select the indices of the next nodes in the sequences, result (batch_size) long
            selected = self._select_node(log_p.exp()[:, 0, :], mask[:, 0, :])  # Squeeze out steps dimension

            state = state.update(selected, i)

            # Now make log_p, selected desired output size by 'unshrinking'
            if self.shrink_size is not None and state.ids.size(0) < batch_size:
                log_p_, selected_ = log_p, selected
                log_p = log_p_.new_zeros(batch_size, *log_p_.size()[1:])
                selected = selected_.new_zeros(batch_size)

                log_p[state.ids[:, 0]] = log_p_
                selected[state.ids[:, 0]] = selected_

            # Collect output of step
            outputs.append(log_p[:, 0, :])

            i = i+1
        lengths = state.lengths + state.get_final_cost()

        # Collected lists, return Tensor
        return torch.stack(outputs, 1), torch.stack(sequences, 1),lengths

P.S. the project is a little bit large,so i don’t know how to simplify it. If you need the whole code pleaz contact me!
If you can’t see where the problem is ,maybe tell me how to find the missing variable is also helpful. And there is an another confusing. What is the version 47 and 46 in the error means?

These errors are often raised by using retain_graph = True as a workaround for another issue. Could you explain why you are using it? If you are not sure and added it to avoid the “trying to backpropagate a second time…” error, check if you have forgotten to detach the computation graph to avoid trying to recompute gradients from previous iterations.

thanks for your reply! I’m using it to try to fix the same error but it didn’t work :sob:

or maybe I can add a.detach_() to stop the backward process, if the project is doing right then it means that the variable a is the problem?

I kinda find the wrong variable but I don’t know how to fix it, anybody could help me :sob:

def _get_parallel_step_context(self, embeddings, state, from_depot=False):
        Returns the context per step, optionally for multiple steps at once (for efficient evaluation of the model)

        current_node = state.get_current_node()
        batch_size, num_steps = current_node.size()

        a = torch.gather(
                  .view(batch_size, num_steps, 1)
                  .expand(batch_size, num_steps, embeddings.size(-1))
          ).view(batch_size, num_steps, embeddings.size(-1))
        # a = a.detach()
        return, self.problem.VEHICLE_CAPACITY - state.used_capacity), -1)

a is the missing one,it’s used to compute the parallel step context

I change the code into: and it worked!

a = torch.gather(
                        .view(batch_size, num_steps, 1).clone()
                        .expand(batch_size, num_steps, embeddings.size(-1))
                ).clone().view(batch_size, num_steps, embeddings.size(-1))