How to fix a SIGSEGV in pytorch when using distributed training (e.g. DDP)?

pix1 · May 7, 2023, 3:25pm

Replacing mp.spawn with the start and join methods can solve this problem. The issue is likely caused by a faulty implementation of spawn in PyTorch, which leads to incorrect mapping of shared memory between processes. Using start and join avoids this problem and prevents segmentation faults.

# mp.spawn(run, args=(world_size, q), nprocs=world_size, join=True)
children = []
for i in range(world_size):
    subproc = mp.Process(target=run, args=(i, world_size, q))
    children.append(subproc)
    subproc.start()

for i in range(world_size):
    children[i].join()

With modifications made like this, the code should work properly.