Torch DDP with accelerate using torchrun cause failed ERROR with exitcode: -11

Hi, I am trying to use accelerate with torchrun, and inside the accelerate code they call torch.nn.parallel.DistributedDataParallel which causes ERROR with either 1GPU or multiple GPU. Here is a simple code example:

## ./debug.py
import os

from accelerate import Accelerator
from accelerate.utils import ProjectConfiguration

from diffusers import UNet2DConditionModel

import torch

def main():
        output_dir = './debug'
        accelerator_project_config = ProjectConfiguration(
            total_limit=10,
            automatic_checkpoint_naming=True,
            project_dir=output_dir,
            logging_dir=os.path.join(output_dir, 'logs'),
        )
          
        accelerator = Accelerator(
            gradient_accumulation_steps=2,
            mixed_precision='fp16',
            project_config=accelerator_project_config,
        )
        
        if accelerator.is_main_process:
            os.makedirs(output_dir, exist_ok=True)
            os.makedirs(os.path.join(output_dir, 'checkpoints'), exist_ok=True)
        
        unet = UNet2DConditionModel.from_pretrained(
            "runwayml/stable-diffusion-v1-5",
            subfolder="unet",
            revision=None
        )
        
        for d_idx in range(accelerator.state.num_processes):
            unet_idx = torch.nn.parallel.DistributedDataParallel(
                unet.cuda(), device_ids=[d_idx], output_device=d_idx,
            )
        unet= accelerator.prepare(unet)

if __name__ == '__main__':
    main()

When I run this code with the following script:

export TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO
export CUDA_VISIBLE_DEVICES=0
torchrun --nproc_per_node=1 ./debug.py

the output will be something like the following:

[I ProcessGroupNCCL.cpp:842] [Rank 0] NCCL watchdog thread started!
**hostname**:70473:70473 [0] NCCL INFO Bootstrap : Using eth0:10.200.155.199<0>
**hostname**:70473:70473 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
**hostname**:70473:70473 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).

**hostname**:70473:70473 [0] misc/cudawrap.cc:112 NCCL WARN cuDriverGetVersion failed with 34
NCCL version 2.14.3+cuda11.7
**hostname**:70473:70611 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
**hostname**:70473:70611 [0] NCCL INFO NET/OFI Selected Provider is tcp (found 1 nics)
**hostname**:70473:70611 [0] NCCL INFO Using network AWS Libfabric
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 70473) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
./debug.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-24_20:02:11
  host      : **hostname**
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 70473)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 70473
=======================================================

Using 8 GPU with --nproc_per_node=8 gives the same error.

The above code part is extracted based on the code in accelerate library.

The environment I am using is as follows:

accelerate==0.20.3
torch==2.0.1
diffusers==0.20.0

with cuda_11.7.r11.7/compiler.31442593_0 and A100 GPUs.

Does anyone know the reason and solution for this? Thanks!

Your system seems to have trouble communicating with the NVIDIA driver. Was this system working before?

Thanks. It should work fine before but I could double-check on this. How could I debug to see if the system communicate well with the NVIDIA driver by some nv-commands?