Got signal SIGSEGV error when I use NCCL backend for DDP

LuoXin-s · April 28, 2023, 11:23am

Describe the bug

I can run this code successfully on default gloo backend, and it will fail when I shift to nccl backend.

Note that I have a higher cuda version than pytorch built, however I expect it could work without problem as per Install pytorch with Cuda 12.1 1 .

I have tried to use python 3.8, however the problem persists.

Versions

Below is my environments:
PyTorch version: 2.0.0+cu117
CUDA used to build PyTorch: 11.7

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Libc version: glibc-2.35

Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

LuoXin-s · April 28, 2023, 1:47pm

Solve by set NCCL_NET=Socket according to the suggestion provided by my cluster manager, does not known reasons.

ptrblck · April 28, 2023, 6:59pm

Double post from here.