MPI does not gracefully terminate

So I have been following the code and tutorial on using pytorch to do distributed machine learning here. I am able to run the code (and it completes all the tasks) but my program does not terminate and I need to manually kill it using ctrl+C. The exact code is here

Right now after completing the task, it hangs after displaying the following warning messages

/anaconda3/lib/python3.7/site-packages/torch/distributed/ UserWarning: For MPI backend, world_size (0) and rank (0) are ignored since they are assigned by the MPI runtime.
  "MPI runtime.".format(world_size, rank)) UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  return F.log_softmax(x)
/anaconda3/lib/python3.7/site-packages/torch/distributed/ UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn("torch.distributed.reduce_op is deprecated, please use "

I run the code using basic command

 /anaconda3/bin/mpirun -np 3 --host node-0,node-1,node-2 python

Do I need to add something in the code to exit gracefully?

This shouldn’t be necessary. Which MPI implementation are you using?

I tried to run the code with gloo backend to check if it is a MPI-only problem, but initially hits the following error:

    self._target(*self._args, **self._kwargs)
  File "", line 132, in init_processes
    fn(rank, size)
  File "", line 103, in run
    train_set, bsz = partition_dataset()
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

After fixing that, hits the error below:

  File "", line 132, in init_processes
    fn(rank, size)
  File "", line 118, in run
    epoch_loss +=[0]
IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

Can we have a min example that can reproduce the hang issue? Thanks!

I figured the issue. The mpi part was stuck because one of the processes were waiting for the other process to send some data.

1 Like

Yeah, I resolved that issues by converting batch_size to integer and[0] to