Is mp.spawn spawning 4 processes causing my `Exception: process 0 terminated with signal SIGSEGV` error and how to fix it?

I start 2 processes because I only have 2 gpus but it starts 4 and then gives me a Exception: process 0 terminated with signal SIGSEGV, why is that? How can I stop it? (I am assuming that is the source of my bug btw)

Error msg:

$ python playground/multiprocessing_playground/ddp_basic_example.py
starting __main__

running main()
current process: <_MainProcess name='MainProcess' parent=None started>
pid: 30735
world_size=2


Start running DDP with model parallel example on rank: 0.
Start running DDP with model parallel example on rank: 1.
current process: <SpawnProcess name='SpawnProcess-1' parent=30735 started>
pid: 30753
current process: <SpawnProcess name='SpawnProcess-2' parent=30735 started>
pid: 30754

Start running DDP with model parallel example on rank: 1.
current process: <SpawnProcess name='SpawnProcess-2' parent=30735 started>
pid: 30754

Start running DDP with model parallel example on rank: 0.
current process: <SpawnProcess name='SpawnProcess-1' parent=30735 started>
pid: 30753
Traceback (most recent call last):
  File "playground/multiprocessing_playground/ddp_basic_example.py", line 152, in <module>
    main()
  File "playground/multiprocessing_playground/ddp_basic_example.py", line 147, in main
    mp.spawn(run_parallel_training_loop, args=(world_size,), nprocs=world_size)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
    raise Exception(
Exception: process 0 terminated with signal SIGSEGV

completely self-contained example:

import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP


def example(rank, world_size):
    # create default process group
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '8888'
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

    # create local model
    model = nn.Linear(10, 10).to(rank)
    # construct DDP model
    ddp_model = DDP(model, device_ids=[rank])
    # define loss function and optimizer
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    # forward pass
    outputs = ddp_model(torch.randn(20, 10).to(rank))
    labels = torch.randn(20, 10).to(rank)
    # backward pass
    loss_fn(outputs, labels).backward()
    # update parameters
    optimizer.step()

def main():
    # world_size = 2
    world_size = torch.cuda.device_count()
    mp.spawn(example,
        args=(world_size,),
        nprocs=world_size,
        join=True)

if __name__=="__main__":
    main()
    print('Done\n\a')

crossposted: https://stackoverflow.com/questions/66268131/why-is-mp-spawn-spawning-4-processes-when-i-only-want-2

1 Like