One of the variables needed for gradient computation has been modified by an inplace operation in CrossEntropyLoss()

Sivagami_S_N · September 29, 2020, 3:36pm

In pytorch 1.6,
with torch.autograd.set_detect_anomaly(True):
pointed at torch.nn.CrossEntropyLoss()

[W python_anomaly_mode.cpp:60] Warning: Error detected in AddmmBackward. Traceback of forward call that caused the error:
(pid=23381) File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/ray/workers/default_worker.py”, line 124, in
(pid=23381) ray.worker.global_worker.main_loop()
(pid=23381) File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/ray/worker.py”, line 421, in main_loop
(pid=23381) self.core_worker.run_task_loop()
(pid=23381) File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/ray/function_manager.py”, line 559, in actor_method_executor
(pid=23381) method_returns = method(actor, *args, **kwargs)
(pid=23381) File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/ray/tune/trainable.py”, line 261, in train
(pid=23381) result = self._train()
(pid=23381) File “/mnt/new_hdd5/sivagami/train_conv/train_conv/trainable_cls.py”, line 638, in _train
(pid=23381) vector_dict = self.net(data, out_layers=self.out_layers) # Forward Propagation
(pid=23381) File “/mnt/new_hdd5/sivagami/train_conv/train_conv/architecture/BaseNet.py”, line 101, in call
(pid=23381) store = module(store)
(pid=23381) File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 723, in _call_impl
(pid=23381) result = self.forward(*input, **kwargs)
(pid=23381) File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/torch/nn/modules/linear.py”, line 93, in forward
(pid=23381) return F.linear(input, self.weight, self.bias)
(pid=23381) File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/torch/nn/functional.py”, line 1674, in linear
(pid=23381) ret = torch.addmm(bias, input, weight.t())
(pid=23381) (function print_stack)
File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py”, line 431, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/ray/worker.py”, line 1515, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::TrainConvTrainableMLNT.train() (pid=23381, ip=192.168.0.24)
File “python/ray/_raylet.pyx”, line 463, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 417, in ray._raylet.execute_task.function_executor
File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/ray/tune/trainable.py”, line 261, in train
result = self._train()
File “/mnt/new_hdd5/sivagami/train_conv/train_conv/trainable_cls.py”, line 691, in _train
grads = torch.autograd.grad(fast_loss, self.net.parameters(), create_graph=True, retain_graph=True, only_inputs=True)
File “/home/cv/.virtualenvs/trainconv_env/lib/python3.7/site-packages/torch/autograd/init.py”, line 192, in grad
inputs, allow_unused)
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [512, 5]], which is output 0 of TBackward, is at version 19; e
xpected version 17 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good lu
ck!

Any solution for this?

ptrblck · September 30, 2020, 9:18am

Without seeing the code I would recommend to remove obvious inplace operations (e.g. via += etc. or via the inplace version of some operations such as tensor.add_()).
If that’ doesn’t help, you could add .clone() statements to return tensors to further isolate which operation is causing the inplace manipulation and thus the error.