I‘m doing a seq2seq project, but when I run my project ,something like this ocurrs. I have read the similar topic,but still don’t know how to fix it. I’ve stuck here days, hope someone can help me.THANKS A LOT~~
The full error message is down below:
Traceback (most recent call last): File "C:/Users/cqf/Desktop/试验/run.py", line 172, in <module> run(get_options()) File "C:/Users/cqf/Desktop/试验/run.py", line 158, in run train_epoch( File "C:\Users\cqf\Desktop\试验\train.py", line 85, in train_epoch train_batch( File "C:\Users\cqf\Desktop\试验\train.py", line 154, in train_batch loss.backward(retain_graph=True) File "F:\Anaconda3\envs\Mypycharm\lib\site-packages\torch\_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "F:\Anaconda3\envs\Mypycharm\lib\site-packages\torch\autograd\__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [256, 10, 128]] is at version 47; expected version 46 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
As you can see, I have already opened the anomous mode,but kinda still confused.
My code is down below:
def train_batch( model, optimizer, baseline, epoch, batch_id, step, batch, tb_logger, opts ): x, bl_val = baseline.unwrap_batch(batch) x = move_to(x, opts.device) bl_val = move_to(bl_val, opts.device) if bl_val is not None else None # Evaluate model, get costs and log probabilities cost, log_likelihood = model(x) # Evaluate baseline, get baseline loss if any (only for critic) bl_val, bl_loss = baseline.eval(x, cost) if bl_val is None else (bl_val, 0) # Calculate loss reinforce_loss = ((cost - bl_val) * log_likelihood).mean() loss = reinforce_loss + bl_loss # Perform backward pass and optimization step optimizer.zero_grad() with torch.autograd.set_detect_anomaly(True): loss.backward(retain_graph=True) # Clip gradient norms and get (clipped) gradient norms for logging grad_norms = clip_grad_norms(optimizer.param_groups, opts.max_grad_norm) optimizer.step()
The forward function is down below:
def forward(self, input, return_pi=False): """ :param input: (batch_size, graph_size, node_dim) input node features or dictionary with multiple tensors :param return_pi: whether to return the output sequences, this is optional as it is not compatible with using DataParallel as the results may be of different lengths on different GPUs :return: """ if self.checkpoint_encoder and self.training: # Only checkpoint if we need gradients embeddings, _ = checkpoint(self.embedder, self._init_embed(input)) else: embeddings, _ = self.embedder(self._init_embed(input)) _log_p, pi, cost = self._inner(input, embeddings) init_lengths, mask = self.problem.get_costs(input, pi) final_lengths = cost + init_lengths[:,None] # Log likelyhood is calculated within the model since returning it per action does not work well with # DataParallel since sequences can be of different lengths ll = self._calc_log_likelihood(_log_p, pi, mask) if return_pi: return final_lengths.squeeze(), ll, pi return final_lengths.squeeze(), ll
the _inner function which is the decode process is down below:
def _inner(self, input, embeddings): outputs =  sequences =  state = self.problem.make_state(input) # Compute keys, values for the glimpse and keys for the logits once as they can be reused in every step fixed = self._precompute(embeddings) batch_size = state.ids.size(0) # Perform decoding steps i = 0 while not (self.shrink_size is None and state.all_finished()): if self.shrink_size is not None: unfinished = torch.nonzero(state.get_finished() == 0) if len(unfinished) == 0: break unfinished = unfinished[:, 0] # Check if we can shrink by at least shrink_size and if this leaves at least 16 # (otherwise batch norm will not work well and it is inefficient anyway) if 16 <= len(unfinished) <= state.ids.size(0) - self.shrink_size: # Filter states state = state[unfinished] fixed = fixed[unfinished] log_p, mask = self._get_log_p(fixed, state) # Select the indices of the next nodes in the sequences, result (batch_size) long selected = self._select_node(log_p.exp()[:, 0, :], mask[:, 0, :]) # Squeeze out steps dimension state = state.update(selected, i) # Now make log_p, selected desired output size by 'unshrinking' if self.shrink_size is not None and state.ids.size(0) < batch_size: log_p_, selected_ = log_p, selected log_p = log_p_.new_zeros(batch_size, *log_p_.size()[1:]) selected = selected_.new_zeros(batch_size) log_p[state.ids[:, 0]] = log_p_ selected[state.ids[:, 0]] = selected_ # Collect output of step outputs.append(log_p[:, 0, :]) sequences.append(selected) i = i+1 lengths = state.lengths + state.get_final_cost() # Collected lists, return Tensor return torch.stack(outputs, 1), torch.stack(sequences, 1),lengths
P.S. the project is a little bit large,so i don’t know how to simplify it. If you need the whole code pleaz contact me!
If you can’t see where the problem is ,maybe tell me how to find the missing variable is also helpful. And there is an another confusing. What is the version 47 and 46 in the error means?