Getting RuntimeError when running the parameter server tutorial

akrisilias21 · July 14, 2020, 5:45am

Hi there,

I’m trying to run this tutorial locally for one parameter server and two workers.

The problem is I’m getting the below error:
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “rpc_parameter_server.py”, line 228, in run_worker
run_training_loop(rank, num_gpus, train_loader, test_loader)
File “rpc_parameter_server.py”, line 187, in run_training_loop
dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [32, 1, 3, 3]] is at version 5; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Here’s my torch version if needed:
pip3 freeze | grep torch
torch==1.5.1+cpu
torchtext==0.6.0
torchvision==0.6.1+cpu

Thanks in advance for any advice!

mrshenli · July 14, 2020, 3:28pm

Hey @rvarm1, I wonder if we need a lock in ParameterServer.forward, otherwise if the execution of forward got sliced into multiple pieces, interleaving execution from different RPC threads could mess up the autograd graph state?

ollien · November 4, 2020, 8:19pm

I’m getting this, too, but interestingly only when I have > 1 worker node.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "rpc_parameter_server.py", line 224, in run_worker
    run_training_loop(rank, num_gpus, train_loader, test_loader)
  File "rpc_parameter_server.py", line 183, in run_training_loop
    dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDAFloatType [128, 10]], which is output 0 of TBackward, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

ollien · November 4, 2020, 10:55pm

I did some looking into this. Adding a lock on ParameterServer.forward has no effect.

    def forward(self, inp):
        # forward_lock defined globally
        with forward_lock:
            inp = inp.to(self.input_device)
            out = self.model(inp)
            # This output is forwarded over RPC, which as of 1.5.0 only accepts CPU tensors.
            # Tensors must be moved in and out of GPU memory due to this.
            out = out.to("cpu")
            return out

rvarm1 · November 14, 2020, 1:37am

Thanks for looking into this! I confirm that I am able to repro the issue, working on root causing it now.

aonrdbit · May 8, 2022, 9:30am

Hello , I have met the same problem, may I ask how this problem was finally solved?@rvarm1
Sincerely thank you if you can give me any suggestions