PyTorch Distributed Data Parallel Process 0 terminated with SIGKILL

Hello,
I am relatively new to PyTorch Distributed Parallel and I have access to GPU nodes with Infiniband so I think I can use the NCCL Backend. I am using Slurm scripts to submit my jobs on these resources. The following is an example of a SLURM script that I am using to submit a job. NOTE HERE that I am using OpenMPI to launch multiple instances of my docker container on the different nodes in the job. The docker container that I am using for this job is linked here.


SLURM FILE
#!/bin/sh
#SBATCH --ntasks-per-node=2
#SBATCH --time=168:00:00
#SBATCH --partition=gpu
#SBATCH --mem=80gb
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --constraint=gpu_32gb
#SBATCH --job-name=binary_classification
#SBATCH --output=binary_classification.out

pwd; hostname; date
env | grep SLURM | sort

ulimit -s unlimited
ulimit -c unlimited

export PYTHONPATH=$WORK/tf-gpu-pkgs

module purge
module load singularity compiler/gcc/4.8 openmpi
module list

mpirun singularity exec $WORK/pyopencv.sif python3 -u $@ --multiprocessing_distributed --dist_backend='nccl' --rank=0 --use_adam=1 --benchmarks=1 --benchmark_arch='vgg19' --batch_size=128 --test=1 --transfer=0 --dataset='binary_dataset'
cgget -r memory.max_usage_in_bytes /slurm/uid_${UID}/job_${SLURM_JOBID}/
mem_report

When I run the job, I get the following error and I am not sure what is exactly causing this issue. My implementation is similar to the ImageNet example on PyTorch distributed. I have not been able to make this work for weeks now and would really appreciate any help on this since I don’t have much experience with distributed systems.

ERROR RECEIVED

Traceback (most recent call last):
  File "distributed_main.py", line 391, in <module>
    main()
  File "distributed_main.py", line 138, in main
    args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53625,1],0]
  Exit code:    1

Thank you,
Ayush

Have you tried other backend types (Gloo, MPI), do they fail with the same error?

How do you initialize process group and construct DistributedDataParallel?

For debugging, we would first try a minimum DDP example like this one and make sure it works correctly with the given environment. And then switch to more complex models.