Is there any upper-level API to disable the detection of the weight's version

ymsdu2004 · December 21, 2020, 9:20am

We are trying to use pipelined model parallelism to speed up model training. This work requires ourselves to manage different versions of weights. This is ok in version 1.4 pytorch, but the following error will be reported after 1.5，
In our scenario, we confirm that the version of the weight is reasonable， so Is there any upper-level API to disable the detection of the weight’s version

raceback (most recent call last):
  File "bert_main.py", line 294, in <module>
    main()
  File "bert_main.py", line 210, in main
    model_args.model_name_or_path) else None
  File "/workspace/pipedream-1.5BW/lib/runtimecontrol.py", line 521, in train
    self.runtime_control.run_forward_backward()
  File "/workspace/pipedream-1.5BW/lib/runtimecontrol.py", line 194, in run_forward_backward
    data_provider=data_provider)
  File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 157, in backward
    result = self._backward(**kwargs)
  File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 317, in _backward
    result = self._backward_g0(**kwargs)
  File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 392, in _backward_g0
    return self._comm_backward(outputs, kwargs)
  File "/workspace/pipedream-1.5BW/lib/callableunit.py", line 174, in _comm_backward
    torch.autograd.backward(forward_output, backward_gradient)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 146, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [768]] is at version 9; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

ptrblck · December 22, 2020, 7:51pm

I don’t think there is a clean way to disable this check, so could you explain your use case a bit? I.e. which operation is manipulating the parameters and causing this check to fail, while you think it shouldn’t?