How to fix Signal 11 (SIGSEGV) problem when I use DDP?

:bug: Describe the bug

I can run this code successfully on default gloo backend, and it will fail when I shift to nccl backend.

Note that I have a higher cuda version than pytorch built, however I expect it could work without problem as per Install pytorch with Cuda 12.1 1.

I have tried to use python 3.8, however the problem persists.

Versions

Below is my environments:
PyTorch version: 2.0.0+cu117
CUDA used to build PyTorch: 11.7

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Libc version: glibc-2.35

Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

You are not using the locally installed CUDA toolkit as the PyTorch pip wheels ship with their own CUDA runtime and dependencies. Only the NVIDIA driver will be used.

To debug the issue you could try to get the stacktrace by launching the application from gdb.

Ok, I’ll try it. Though I guess the problem may come from python 3.9, I see some others downgrade to python 3.8 to fix this problem. see How to fix a SIGSEGV in pytorch when using distributed training (e.g. DDP)? - #7 by Brando_Miranda and How to fix SIGSEGV in distributed training (i.e. DDP) 2 - #4 by dsethz

I have run the same code on python 3.8, it fails also.

I can not use gdb, since my program works on a GPU cluster.

How about trying gloo instead of nccl for your backend communication? I fix the SIGSEGV error by it.