When I use two gpus to train my model, I got RuntimeError below:
Process SpawnProcess-2:
Traceback (most recent call last):
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 190, in run
main(rank, dev_id, args)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 149, in main
train(args[‘gnn’], model, device, train_loader, criterion, optimizer, args[‘num_devices’], rank)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 41, in train
optimizer.backward_and_step(loss)
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 146, in backward_and_step
self._sync_gradient()
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 127, in _sync_gradient
dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 902, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: Stop_waiting response is expected
Process SpawnProcess-1:
Traceback (most recent call last):
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 190, in run
main(rank, dev_id, args)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 149, in main
train(args[‘gnn’], model, device, train_loader, criterion, optimizer, args[‘num_devices’], rank)
File “/home/ubuntu/ogb/ogb/graphproppred/m.py”, line 41, in train
optimizer.backward_and_step(loss)
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 146, in backward_and_step
self._sync_gradient()
File “/home/ubuntu/ogb/ogb/graphproppred/utils.py”, line 127, in _sync_gradient
dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
File “/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 902, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: Stop_waiting response is expected
Here is the code where the error occurred:
def _sync_gradient(self):
"""Average gradients across all subprocesses."""
for param_group in self.optimizer.param_groups:
for p in param_group['params']:
if p.requires_grad and p.grad is not None:
# print(p.grad.data.shape, p.grad.data.device)
dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
p.grad.data /= self.n_processes
Ps. When I do “print(p.grad.data.shape, p.grad.data.device)”, I find the grads are normal and has the same shape [1,300] on 2 different gpus. So I’m confused why it stopped here.