I’m trying to run this tutorial locally for one parameter server and two workers.
The problem is I’m getting the below error:
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “rpc_parameter_server.py”, line 228, in run_worker
run_training_loop(rank, num_gpus, train_loader, test_loader)
File “rpc_parameter_server.py”, line 187, in run_training_loop
dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [32, 1, 3, 3]] is at version 5; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Here’s my torch version if needed:
pip3 freeze | grep torch
torch==1.5.1+cpu
torchtext==0.6.0
torchvision==0.6.1+cpu
Hey @rvarm1, I wonder if we need a lock in ParameterServer.forward, otherwise if the execution of forward got sliced into multiple pieces, interleaving execution from different RPC threads could mess up the autograd graph state?
I’m getting this, too, but interestingly only when I have > 1 worker node.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "rpc_parameter_server.py", line 224, in run_worker
run_training_loop(rank, num_gpus, train_loader, test_loader)
File "rpc_parameter_server.py", line 183, in run_training_loop
dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDAFloatType [128, 10]], which is output 0 of TBackward, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I did some looking into this. Adding a lock on ParameterServer.forward has no effect.
def forward(self, inp):
# forward_lock defined globally
with forward_lock:
inp = inp.to(self.input_device)
out = self.model(inp)
# This output is forwarded over RPC, which as of 1.5.0 only accepts CPU tensors.
# Tensors must be moved in and out of GPU memory due to this.
out = out.to("cpu")
return out