DDP: ProcessExitedException: process 0 terminated with exit code 1

On LambdaLabs, I spin up a two-GPU machine. I run the simple example code from the pytorch docs

However, I can’t even get the simple example to run:

---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-1-e1523c2c83af> in <module>
     34 
     35 if __name__=="__main__":
---> 36     main()

<ipython-input-1-e1523c2c83af> in main()
     28 def main():
     29     world_size = 2
---> 30     mp.spawn(example,
     31         args=(world_size,),
     32         nprocs=world_size,

~/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    228                ' torch.multiprocessing.start_process(...)' % start_method)
    229         warnings.warn(msg)
--> 230     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

~/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 

~/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    137                 )
    138             else:
--> 139                 raise ProcessExitedException(
    140                     "process %d terminated with exit code %d" %
    141                     (error_index, exitcode),

ProcessExitedException: process 0 terminated with exit code 1

How can I get a simple DDP example to run?

1 Like

This is the sample code:

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP


def example(rank, world_size):
    # create default process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)
    # create local model
    model = nn.Linear(10, 10).to(rank)
    # construct DDP model
    ddp_model = DDP(model, device_ids=[rank])
    # define loss function and optimizer
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    # forward pass
    outputs = ddp_model(torch.randn(20, 10).to(rank))
    labels = torch.randn(20, 10).to(rank)
    # backward pass
    loss_fn(outputs, labels).backward()
    # update parameters
    optimizer.step()

def main():
    world_size = 2
    mp.spawn(example,
        args=(world_size,),
        nprocs=world_size,
        join=True)

if __name__=="__main__":
    main()

I will note that it works if join=False. But why does this simple pytorch doc example not work as written?

Hi,

This works ok for me with join=True. Seems like your process 0 is dying for some reason, can you add logging to example function and see where is the problem?

(also seems like you aren’t using GPUs here)

@agolynski could you suggest how/where I should add logging?

Yes, I’m on a 2 GPU machine from LambdaLabs if you want to try and replicate. (I upgrade torch to the latest release when I create the instance.)

Do I need to add anything extra to use the GPUs?

The code you have doesn’t use GPUs, it’s CPU only tensors.

I suggest add some print statements in your model before and after critical sections of the code, i.e.
dist.init_process_group(“gloo”, rank=rank, world_size=world_size)
forward pass
backward pass
optimizer.step()
and see which line causes your error.