Getting RuntimeError when running the parameter server tutorial

Hi there,

I’m trying to run this tutorial locally for one parameter server and two workers.

The problem is I’m getting the below error:
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “rpc_parameter_server.py”, line 228, in run_worker
run_training_loop(rank, num_gpus, train_loader, test_loader)
File “rpc_parameter_server.py”, line 187, in run_training_loop
dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [32, 1, 3, 3]] is at version 5; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Here’s my torch version if needed:
pip3 freeze | grep torch
torch==1.5.1+cpu
torchtext==0.6.0
torchvision==0.6.1+cpu

Thanks in advance for any advice!

Hey @rvarm1, I wonder if we need a lock in ParameterServer.forward, otherwise if the execution of forward got sliced into multiple pieces, interleaving execution from different RPC threads could mess up the autograd graph state?