RuntimeError: Stop_waiting response is expected

yangkz · May 14, 2020, 6:10pm

When I use two gpus to train my model, I got RuntimeError below:

Process SpawnProcess-2:
Traceback (most recent call last):
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 190, in run
main(rank, dev_id, args)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 149, in main
train(args[‘gnn’], model, device, train_loader, criterion, optimizer, args[‘num_devices’], rank)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 41, in train
optimizer.backward_and_step(loss)
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 146, in backward_and_step
self._sync_gradient()
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 127, in _sync_gradient
dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 902, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: Stop_waiting response is expected

Process SpawnProcess-1:
Traceback (most recent call last):
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 190, in run
main(rank, dev_id, args)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 149, in main
train(args[‘gnn’], model, device, train_loader, criterion, optimizer, args[‘num_devices’], rank)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 41, in train
optimizer.backward_and_step(loss)
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 146, in backward_and_step
self._sync_gradient()
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 127, in _sync_gradient
dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 902, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: Stop_waiting response is expected

Here is the code where the error occurred:

def _sync_gradient(self):
    """Average gradients across all subprocesses."""
    for param_group in self.optimizer.param_groups:
        for p in param_group['params']:
            if p.requires_grad and p.grad is not None:
                # print(p.grad.data.shape, p.grad.data.device)
                dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
                p.grad.data /= self.n_processes

Ps. When I do “print(p.grad.data.shape, p.grad.data.device)”, I find the grads are normal and has the same shape [1,300] on 2 different gpus. So I’m confused why it stopped here.

mrshenli · May 14, 2020, 7:07pm

Is the result of p.requires_grad and p.grad is not None always the same across all process and all parameters? If allreduce ops on different processes could run into desync.

Which backend are you using (NCCL/Gloo/MPI) and which PyTorch version are you using? It will be helpful to have a min repro of this error.

yangkz · May 15, 2020, 7:55am

Thanks for your reply!
Here is my short code:

def main(rank, dev_id, args):
    torch.distributed.init_process_group(backend="nccl",
                                         init_method='tcp://localhost:22',
                                         world_size=args['num_devices'],
                                         rank=dev_id)
    model = mymodel.to(dev_id)
    optimizer = optim.Adam(model.parameters(), lr=args['lr'])
    for epochs:
        pred = model(inputs)
        loss = criterion(pred, label)
        optimizer.zero_grad()
        loss.backward()

        for param_group in optimizer.param_groups:
            for p in param_group['params']:
                if p.requires_grad and p.grad is not None:
                    # print(p.grad.data.shape, p.grad.data.device) Ps. We can get grad information here
                    dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
                    p.grad.data /= n_processes
        optimizer.step()
    torch.distributed.barrier()

mp = torch.multiprocessing.get_context('spawn')
for id, device_id in enumerate(devices):
        procs.append(mp.Process(target=main, args=(id, device_id, args), daemon=True))
        procs[-1].start()
for p in procs:
    p.join()

The pytorch version I’m using is 1.4.0 and all my codes are running in AWS instance.
Ps. The port of init_method can only be 22 which is strange otherwise it got Runtime Error like this:

File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 120, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: connect() timed out.

Thanks for your time and reading!

yangkz · May 18, 2020, 1:46am

The error has been fixed.
‘Stop_waiting response is expected’ error occurred in TCPStore.cpp. So it was actually the communication problem. It works finally when I reinstalled NCCL: https://github.com/NVIDIA/nccl.git