Why does my code create a ProcessExitedException: process 0 terminated with signal SIGSEGV

The following minimal example causes the error torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV and I’m not able to figure out why

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import os
import datetime

def worker(rank, task_queue):
    print("rank", rank)
    dist.init_process_group("nccl", rank=rank, world_size=2, timeout=datetime.timedelta(seconds=10))
    torch.cuda.set_device(rank)
    tensor = torch.randn(10).cuda()
    dist.all_reduce(tensor)
    while True:
        task = task_queue.get()  # Blocking call until a new task is received
        if task == "stop":
            break
        else:
            # Process the task
            print("Received task:", task)
            # Add your task processing logic here
    # torch.cuda.synchronize(device=rank)
    
def start_worker(rank, task_queues):
    worker(rank, task_queues[rank])

if __name__ == "__main__":
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "29501"
    os.environ["TORCH_CPP_LOG_LEVEL"]="INFO"
    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
    task_queues = [mp.Queue() for i in range(1)]
    mp.spawn(start_worker, nprocs=2, args=(task_queues, ))
    task_queues[0].put("Data for worker 0")

Here is the output of nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:41:00.0  On |                  Off |
|  0%   51C    P8              27W / 450W |    156MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:42:00.0 Off |                  Off |
|  0%   48C    P8              20W / 450W |     13MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2128      G   /usr/bin/gnome-shell                        149MiB |
|    1   N/A  N/A      2128      G   /usr/bin/gnome-shell                          6MiB |
+---------------------------------------------------------------------------------------+

And here is the NCCL debug info printed:

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.

rank 1

[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 29501).

[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.

[I socket.cpp:830] [c10d - trace] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry.

rank 0

[I socket.cpp:452] [c10d - debug] The server socket will attempt to listen on an IPv6 address.

[I socket.cpp:502] [c10d - debug] The server socket is attempting to listen on [::]:29501.

[I socket.cpp:576] [c10d] The server socket has started to listen on [::]:29501.

[I TCPStore.cpp:252] [c10d - debug] The server has started on port = 29501.

[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 29501).

[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.

[I socket.cpp:297] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:38040.

[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:38040.

[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host localhost:29501

[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.

[I socket.cpp:297] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:38054.

[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:38054.

[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host localhost:29501

[I ProcessGroupNCCL.cpp:687] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 10000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: OFF, ID=94044443051152

[I ProcessGroupNCCL.cpp:687] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 10000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: OFF, ID=94319499085072

[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))

[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))

[I ProcessGroupNCCL.cpp:1342] NCCL_DEBUG: N/A

Check if disabling p2p in NCCL would help.