We are trying to use pipelined model parallelism to speed up model training. This work requires ourselves to manage different versions of weights. This is ok in version 1.4 pytorch, but the following error will be reported after 1.5,
In our scenario, we confirm that the version of the weight is reasonable, so Is there any upper-level API to disable the detection of the weight’s version
raceback (most recent call last):
File "bert_main.py", line 294, in <module>
main()
File "bert_main.py", line 210, in main
model_args.model_name_or_path) else None
File "/workspace/pipedream-1.5BW/lib/runtimecontrol.py", line 521, in train
self.runtime_control.run_forward_backward()
File "/workspace/pipedream-1.5BW/lib/runtimecontrol.py", line 194, in run_forward_backward
data_provider=data_provider)
File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 157, in backward
result = self._backward(**kwargs)
File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 317, in _backward
result = self._backward_g0(**kwargs)
File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 392, in _backward_g0
return self._comm_backward(outputs, kwargs)
File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 174, in _comm_backward
torch.autograd.backward(forward_output, backward_gradient)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 146, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [768]] is at version 9; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).