I try to run the example from the DDP tutorial:
import torch import torch.distributed as dist import torch.multiprocessing as mp import torch.nn as nn import torch.optim as optim from torch.nn.parallel import DistributedDataParallel as DDP def example(rank, world_size): # create default process group dist.init_process_group("nccl", rank=rank, init_method=None, world_size=world_size) # create local model model = nn.Linear(10, 10).to(rank) # construct DDP model ddp_model = DDP(model, device_ids=[rank]) # define loss function and optimizer loss_fn = nn.MSELoss() optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) # forward pass outputs = ddp_model(torch.randn(20, 10).to(rank)) labels = torch.randn(20, 10).to(rank) # backward pass loss_fn(outputs, labels).backward() # update parameters optimizer.step() def main(): world_size = 2 mp.spawn(ex, args=(world_size,), nprocs=world_size, join=True) if __name__ == '__main__': main()
I get an error
Exception: process 0 terminated with exit code 1
I am running this in a jupyter notebook inside a docker container.
When I run this as a script inside the container but outside jupyter, it seems it works fine.
What would be the reason it is not working in jupyter?
In general, what is the method to use DDP in a notebook?