PyTorch Distributed Data Parallel Process 0 terminated with SIGKILL

ayushm-agrawal · May 17, 2020, 6:31pm

Hello,
I am relatively new to PyTorch Distributed Parallel and I have access to GPU nodes with Infiniband so I think I can use the NCCL Backend. I am using Slurm scripts to submit my jobs on these resources. The following is an example of a SLURM script that I am using to submit a job. NOTE HERE that I am using OpenMPI to launch multiple instances of my docker container on the different nodes in the job. The docker container that I am using for this job is linked here.

github.com

LordVoldemort28/docker-deep-machine-learning/blob/master/Dockerfile

FROM unlhcc/cuda-ubuntu:9.2
MAINTAINER Rahul Prajapati <rahul.prajapati90904@gmail.com>

#github https://github.com/LordVoldemort28/docker-deep-machine-learning

#Credit - waleedka/modern-deep-learning
#https://hub.docker.com/r/waleedka/modern-deep-learning/ 

# Supress warnings about missing front-end. As recommended at:
# http://stackoverflow.com/questions/22466255/is-it-possibe-to-answer-dialog-questions-when-installing-under-docker
ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \ 
    apt-utils \ 
    git \ 
    curl \ 
    vim \ 
    unzip \ 
    wget \
    build-essential cmake \

This file has been truncated. show original

SLURM FILE

#!/bin/sh
#SBATCH --ntasks-per-node=2
#SBATCH --time=168:00:00
#SBATCH --partition=gpu
#SBATCH --mem=80gb
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --constraint=gpu_32gb
#SBATCH --job-name=binary_classification
#SBATCH --output=binary_classification.out

pwd; hostname; date
env | grep SLURM | sort

ulimit -s unlimited
ulimit -c unlimited

export PYTHONPATH=$WORK/tf-gpu-pkgs

module purge
module load singularity compiler/gcc/4.8 openmpi
module list

mpirun singularity exec $WORK/pyopencv.sif python3 -u $@ --multiprocessing_distributed --dist_backend='nccl' --rank=0 --use_adam=1 --benchmarks=1 --benchmark_arch='vgg19' --batch_size=128 --test=1 --transfer=0 --dataset='binary_dataset'
cgget -r memory.max_usage_in_bytes /slurm/uid_${UID}/job_${SLURM_JOBID}/
mem_report

When I run the job, I get the following error and I am not sure what is exactly causing this issue. My implementation is similar to the ImageNet example on PyTorch distributed. I have not been able to make this work for weeks now and would really appreciate any help on this since I don’t have much experience with distributed systems.

ERROR RECEIVED

Traceback (most recent call last):
  File "distributed_main.py", line 391, in <module>
    main()
  File "distributed_main.py", line 138, in main
    args=(ngpus_per_node, args))
  File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53625,1],0]
  Exit code:    1

Thank you,
Ayush

mrshenli · May 17, 2020, 6:57pm

Have you tried other backend types (Gloo, MPI), do they fail with the same error?

How do you initialize process group and construct DistributedDataParallel?

For debugging, we would first try a minimum DDP example like this one and make sure it works correctly with the given environment. And then switch to more complex models.

Brando_Miranda · February 18, 2021, 9:36pm

trying mpi is probably a really bad idea see my error:

$ python playground/multiprocessing_playground/ddp_basic_example.py
starting __main__

running main()
current process: <_MainProcess name='MainProcess' parent=None started>
pid: 4060
world_size=1
/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:404: UserWarning: For MPI backend, world_size (1) and rank (0) are ignored since they are assigned by the MPI runtime.
  warnings.warn(
Traceback (most recent call last):
  File "playground/multiprocessing_playground/ddp_basic_example.py", line 153, in <module>
    main()
  File "playground/multiprocessing_playground/ddp_basic_example.py", line 148, in main
    mp.spawn(run_parallel_training_loop, args=(world_size,), nprocs=world_size)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/miranda9/ML4Coq/playground/multiprocessing_playground/ddp_basic_example.py", line 107, in run_parallel_training_loop
    setup_process(rank, world_size)
  File "/home/miranda9/ML4Coq/playground/multiprocessing_playground/ddp_basic_example.py", line 58, in setup_process
    dist.init_process_group(backend, rank=rank, world_size=world_size)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 409, in init_process_group
    _default_pg = _new_process_group_helper(
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 482, in _new_process_group_helper
    raise RuntimeError(
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.

I also tried both gloo and nccl both with different error e.g.

$ python playground/multiprocessing_playground/ddp_basic_example.py
starting __main__

running main()
current process: <_MainProcess name='MainProcess' parent=None started>
pid: 4175
world_size=1

Start running DDP with model parallel example on rank: 0.
current process: <SpawnProcess name='SpawnProcess-1' parent=4175 started>
pid: 4198

End running DDP with model parallel example on rank: 0.
End current process: <SpawnProcess name='SpawnProcess-1' parent=4175 started>
End pid: 4198
*** Error in `/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python': free(): invalid size: 0x000055b4ffbb8800 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2ac3ca566299]
/usr/local/cuda/lib64/libcublasLt.so.11(free_gemm_select+0x4d)[0x2ac3ee03de7d]
/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/lib/../../../../libcublas.so.11(cublasDestroy_v2+0x165)[0x2ac3e7929af5]
/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc13a3d)[0x2ac3ff678a3d]
/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc13b41)[0x2ac3ff678b41]
/lib64/libc.so.6(+0x39ce9)[0x2ac3ca51ece9]
/lib64/libc.so.6(+0x39d37)[0x2ac3ca51ed37]
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python(+0x25fe29)[0x55b4ab5b4e29]
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python(+0x25fe5d)[0x55b4ab5b4e5d]
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python(+0x25feb4)[0x55b4ab5b4eb4]
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python(PyRun_SimpleStringFlags+0x66)[0x55b4ab5b7d66]
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python(Py_RunMain+0x165)[0x55b4ab5b7ed5]
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python(Py_BytesMain+0x39)[0x55b4ab5b82d9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac3ca507555]
/home/miranda9/miniconda3/envs/automl-meta-learning/bin/python(+0x203493)[0x55b4ab558493]
======= Memory map: ========
200000000-200200000 ---p 00000000 00:00 0 
200200000-200400000 rw-s 00000000 00:05 2897                             /dev/nvidiactl
200400000-202400000 rw-s 00000000 00:05 2897                             /dev/nvidiactl
202400000-205400000 rw-s 00000000 00:05 2897                             /dev/nvidiactl
...
0000 r--p 00534000 00:31 9522865797                 /home/miranda9/miniconda3/envs/automl-meta-learning/lib/libmkl_rt.so
2ac46c6e0000-2ac46c6e2000 rw-p 0053a000 00:31 9522865797                 /home/miranda9/miniconda3/envs/automl-meta-learning/lib/libmkl_rt.so
2ac46c6e2000-2ac46c6f6000 rw-p 00000000 00:00 0 
2ac46c6f6000-2ac46c6fc000 r--p 00000000 00:31 17557268491                /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/mkl/_py_mkl_service.cpython-38-x86_64-linux-gnu.so
2ac46c6fc000-2ac46c712000 r-xp 00006000 00:31 17557268491                /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/mkl/_py_mkl_service.cpython-38-x86_64-linux-gnu.so
2ac46c712000-2ac46c715000 r--p 0001c000 00:31 17557268491                /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/mkl/_py_mkl_service.cpython-38-x86_64-linux-gnu.so
2ac46c715000-2ac46c716000 ---p 0001f000 00:31 17557268491                /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/mkl/_py_mkl_service.cpython-38-x86_64-linux-gnu.so
2ac46c716000-2ac46c717000 r--p 0001f000 00:31 17557268491                /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/mkl/_py_mkl_service.cpython-38-x86_64-linux-gnu.so
2ac46c717000-2ac46c71a000 rw-p 00020000 00:31 17557268491                /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/mkl/_py_mkl_service.cpython-38-x86_64-linux-gnu.so
2ac46c71a000-2ac46c71b000 rw-p 00000000 00:00 0 
2ac46c71b000-2ac46c745000 r--p 00000000 00:31 25797916                   /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
2ac46c745000-2ac46c9b7000 r-xp 0002a000 00:31 25797916                   /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
2ac46c9b7000-2ac46ca3f000 r--p 0029c000 00:31 25797916                   /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
2ac46ca3f000-2ac46ca42000 r--p 00323000 00:31 25797916                   /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
2ac46ca42000-2ac46ca5e000 rw-p 00326000 00:31 25797916                   /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so
2ac46ca5e000-2ac46cabf000 rw-p 00000000 00:00 0 
2ac46cabf000-2ac46cac4000 r--p 00000000 00:31 30757369024                /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/lib-dynload/_datetime.cpython-38-x86_64-linux-gnu.so
...
0000 00:05 2897                       /dev/nvidiactl
2ac46cc64000-2ac46cc65000 rw-s 00000000 00:05 2897                       /dev/nvidiactl
2ac46cc65000-2ac46cc66000 rw-s 00000000 00:05 2897                       /dev/nvidiactl
2ac46cc6f000-2ac46cc78000 r--p 00000000 00:31 25797914                   /home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/numpy/core/_multiarray_tests.cpython-38-x86_64-linux-gnu.so
2ac46cc78000-2ac46cc8b000 r-xp 00009000 00:31 25797914                   /home/miranda9/miniconda3/envs/automl-meta-learninTraceback (most recent call last):
  File "playground/multiprocessing_playground/ddp_basic_example.py", line 153, in <module>
    main()
  File "playground/multiprocessing_playground/ddp_basic_example.py", line 148, in main
    mp.spawn(run_parallel_training_loop, args=(world_size,), nprocs=world_size)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
    raise Exception(
Exception: process 0 terminated with signal SIGABRT

Brando_Miranda · February 18, 2021, 9:37pm

this is my minimum self contained example that runs out of the box:

"""
Based on: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Correctness of code: https://stackoverflow.com/questions/66226135/how-to-parallelize-a-training-loop-ever-samples-of-a-batch-when-cpu-is-only-avai

Note: as opposed to the multiprocessing (torch.multiprocessing) package, processes can use
different communication backends and are not restricted to being executed on the same machine.
"""
import time

from typing import Tuple

import torch
from torch import nn, optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

import os

num_epochs = 5
batch_size = 8
Din, Dout = 10, 5
data_x = torch.randn(batch_size, Din)
data_y = torch.randn(batch_size, Dout)
data = [(i*data_x, i*data_y) for i in range(num_epochs)]

class PerDeviceModel(nn.Module):
    """
    Toy example for a model ran in parallel but not distributed accross gpus
    (only processes with their own gpu or hardware)
    """
    def __init__(self):
        super().__init__()
        self.net1 = nn.Linear(Din, Din)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(Din, Dout)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

def setup_process(rank, world_size, backend='nccl'):
    """
    Initialize the distributed environment (for each process).

    gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
    it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.
    """
    # set up the master's ip address so this child process can coordinate
    # os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
    # if torch.cuda.is_available():
    #     backend = 'nccl'
    # Initializes the default distributed process group, and this will also initialize the distributed package.
    dist.init_process_group(backend, rank=rank, world_size=world_size)

def cleanup():
    """ Destroy a given process group, and deinitialize the distributed package """
    dist.destroy_process_group()

def get_batch(batch: Tuple[torch.Tensor, torch.Tensor], rank):
    x, y = batch
    if torch.cuda.is_available():
        x, y = x.to(rank), y.to(rank)
    else:
        x, y = x.share_memory_(), y.share_memory_()
    return x, y

def get_ddp_model(model: nn.Module, rank):
    """
    Moves the underlying storage to shared memory.

        This is a no-op if the underlying storage is already in shared memory
        and for CUDA tensors. Tensors in shared memory cannot be resized.

    :return:

    TODO: does this have to be done outside or inside the process? my guess is that it doesn't matter because
    1) if its on gpu once it's on the right proc it moves it to cpu with id rank via mdl.to(rank)
    2) if it's on cpu then mdl.share_memory() or data.share_memory() is a no op if it's already in shared memory o.w.
    """
    # if gpu avail do the standard of creating a model and moving the model to the GPU with id rank
    if torch.cuda.is_available():
    # create model and move it to GPU with id rank
        model = model.to(rank)
        ddp_model = DDP(model, device_ids=[rank])
    else:
    # if we want multiple cpu just make sure the model is shared properly accross the cpus with shared_memory()
    # note that op is a no op if it's already in shared_memory
        model = model.share_memory()
        ddp_model = DDP(model)  # I think removing the devices ids should be fine...?
    return ddp_model
    # return OneDeviceModel().to(rank) if torch.cuda.is_available() else OneDeviceModel().share_memory()

def run_parallel_training_loop(rank, world_size):
    """
    Distributed function to be implemented later.

    This is the function that is actually ran in each distributed process.

    Note: as DDP broadcasts model states from rank 0 process to all other processes in the DDP constructor,
    you don’t need to worry about different DDP processes start from different model parameter initial values.
    """
    setup_process(rank, world_size)
    print()
    print(f"Start running DDP with model parallel example on rank: {rank}.")
    print(f'current process: {mp.current_process()}')
    print(f'pid: {os.getpid()}')

    # get ddp model
    model = PerDeviceModel()
    ddp_model = get_ddp_model(model, rank)

    # do training
    for batch_idx, batch in enumerate(data):
        x, y = get_batch(batch, rank)
        loss_fn = nn.MSELoss()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

        optimizer.zero_grad()
        outputs = ddp_model(x)
        # Gradient synchronization communications take place during the backward pass and overlap with the backward computation.
        loss_fn(outputs, y).backward()  # When the backward() returns, param.grad already contains the synchronized gradient tensor.
        optimizer.step()  # TODO how does the optimizer know to do the gradient step only once?

    print()
    print(f"End running DDP with model parallel example on rank: {rank}.")
    print(f'End current process: {mp.current_process()}')
    print(f'End pid: {os.getpid()}')
    # Destroy a given process group, and deinitialize the distributed package
    cleanup()

def main():
    print()
    print('running main()')
    print(f'current process: {mp.current_process()}')
    print(f'pid: {os.getpid()}')
    # args
    if torch.cuda.is_available():
        world_size = torch.cuda.device_count()
    else:
        world_size = mp.cpu_count()
    world_size = 1
    print(f'world_size={world_size}')
    mp.spawn(run_parallel_training_loop, args=(world_size,), nprocs=world_size)

if __name__ == "__main__":
    print('starting __main__')
    start = time.time()
    main()
    print(f'execution length = {time.time() - start}')
    print('Done!\a\n')

Yanli_Zhao · February 19, 2021, 8:40pm

looks like your DDP model is trained as expected, crashed when it exits. Do you want to pass “join=True” to mp.spawn()?