The following minimal example causes the error torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV and I’m not able to figure out why
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import os
import datetime
def worker(rank, task_queue):
print("rank", rank)
dist.init_process_group("nccl", rank=rank, world_size=2, timeout=datetime.timedelta(seconds=10))
torch.cuda.set_device(rank)
tensor = torch.randn(10).cuda()
dist.all_reduce(tensor)
while True:
task = task_queue.get() # Blocking call until a new task is received
if task == "stop":
break
else:
# Process the task
print("Received task:", task)
# Add your task processing logic here
# torch.cuda.synchronize(device=rank)
def start_worker(rank, task_queues):
worker(rank, task_queues[rank])
if __name__ == "__main__":
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29501"
os.environ["TORCH_CPP_LOG_LEVEL"]="INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
task_queues = [mp.Queue() for i in range(1)]
mp.spawn(start_worker, nprocs=2, args=(task_queues, ))
task_queues[0].put("Data for worker 0")
Here is the output of nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 On | Off |
| 0% 51C P8 27W / 450W | 156MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:42:00.0 Off | Off |
| 0% 48C P8 20W / 450W | 13MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2128 G /usr/bin/gnome-shell 149MiB |
| 1 N/A N/A 2128 G /usr/bin/gnome-shell 6MiB |
+---------------------------------------------------------------------------------------+
And here is the NCCL debug info printed:
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
rank 1
[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 29501).
[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:830] [c10d - trace] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry.
rank 0
[I socket.cpp:452] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:502] [c10d - debug] The server socket is attempting to listen on [::]:29501.
[I socket.cpp:576] [c10d] The server socket has started to listen on [::]:29501.
[I TCPStore.cpp:252] [c10d - debug] The server has started on port = 29501.
[I socket.cpp:686] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 29501).
[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:297] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:38040.
[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:38040.
[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host localhost:29501
[I socket.cpp:761] [c10d - trace] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:297] [c10d - debug] The server socket on [::]:29501 has accepted a connection from [localhost]:38054.
[I socket.cpp:849] [c10d] The client socket has connected to [localhost]:29501 on [localhost]:38054.
[I TCPStore.cpp:261] [c10d - debug] TCP client connected to host localhost:29501
[I ProcessGroupNCCL.cpp:687] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 10000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: OFF, ID=94044443051152
[I ProcessGroupNCCL.cpp:687] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 1, NCCL_ENABLE_TIMING: 1, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 10000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, NCCL_DEBUG: OFF, ID=94319499085072
[I ProcessGroupWrapper.cpp:562] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupWrapper.cpp:562] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[I ProcessGroupNCCL.cpp:1342] NCCL_DEBUG: N/A