MPI does not gracefully terminate

Archie_Nidhi · July 17, 2020, 11:02pm

So I have been following the code and tutorial on using pytorch to do distributed machine learning here. I am able to run the code (and it completes all the tasks) but my program does not terminate and I need to manually kill it using ctrl+C. The exact code is here

Right now after completing the task, it hangs after displaying the following warning messages

/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:406: UserWarning: For MPI backend, world_size (0) and rank (0) are ignored since they are assigned by the MPI runtime.
  "MPI runtime.".format(world_size, rank))
train_dist.py:72: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  return F.log_softmax(x)
/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:125: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn("torch.distributed.reduce_op is deprecated, please use "

I run the code using basic command

 /anaconda3/bin/mpirun -np 3 --host node-0,node-1,node-2 python train_dist.py

Do I need to add something in the code to exit gracefully?

mrshenli · July 18, 2020, 4:01pm

This shouldn’t be necessary. Which MPI implementation are you using?

mrshenli · July 18, 2020, 4:07pm

I tried to run the code with gloo backend to check if it is a MPI-only problem, but initially hits the following error:

...
    self._target(*self._args, **self._kwargs)
  File "test.py", line 132, in init_processes
    fn(rank, size)
  File "test.py", line 103, in run
    train_set, bsz = partition_dataset()
  ...
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

After fixing that, hits the error below:

  File "test.py", line 132, in init_processes
    fn(rank, size)
  File "test.py", line 118, in run
    epoch_loss += loss.data[0]
IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

Can we have a min example that can reproduce the hang issue? Thanks!

Archie_Nidhi · July 19, 2020, 2:59am

I figured the issue. The mpi part was stuck because one of the processes were waiting for the other process to send some data.

Archie_Nidhi · July 19, 2020, 3:00am

Yeah, I resolved that issues by converting batch_size to integer and loss.data[0] to loss.data.item()