turian
April 11, 2021, 1:57pm
1
On LambdaLabs, I spin up a two-GPU machine. I run the simple example code from the pytorch docs
However, I can’t even get the simple example to run:
---------------------------------------------------------------------------
ProcessExitedException Traceback (most recent call last)
<ipython-input-1-e1523c2c83af> in <module>
34
35 if __name__=="__main__":
---> 36 main()
<ipython-input-1-e1523c2c83af> in main()
28 def main():
29 world_size = 2
---> 30 mp.spawn(example,
31 args=(world_size,),
32 nprocs=world_size,
~/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
228 ' torch.multiprocessing.start_process(...)' % start_method)
229 warnings.warn(msg)
--> 230 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
~/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
186
187 # Loop on join until it returns True or raises an exception.
--> 188 while not context.join():
189 pass
190
~/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
137 )
138 else:
--> 139 raise ProcessExitedException(
140 "process %d terminated with exit code %d" %
141 (error_index, exitcode),
ProcessExitedException: process 0 terminated with exit code 1
How can I get a simple DDP example to run?
1 Like
turian
April 11, 2021, 2:01pm
2
This is the sample code:
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
def example(rank, world_size):
# create default process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
# create local model
model = nn.Linear(10, 10).to(rank)
# construct DDP model
ddp_model = DDP(model, device_ids=[rank])
# define loss function and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
# forward pass
outputs = ddp_model(torch.randn(20, 10).to(rank))
labels = torch.randn(20, 10).to(rank)
# backward pass
loss_fn(outputs, labels).backward()
# update parameters
optimizer.step()
def main():
world_size = 2
mp.spawn(example,
args=(world_size,),
nprocs=world_size,
join=True)
if __name__=="__main__":
main()
I will note that it works if join=False. But why does this simple pytorch doc example not work as written?
agolynski
(Alexander Golynski)
April 12, 2021, 6:46pm
3
turian:
mp.spawn(example,
args=(world_size,),
nprocs=world_size,
join=True)
Hi,
This works ok for me with join=True. Seems like your process 0 is dying for some reason, can you add logging to example function and see where is the problem?
(also seems like you aren’t using GPUs here)
turian
April 15, 2021, 1:05am
4
@agolynski could you suggest how/where I should add logging?
Yes, I’m on a 2 GPU machine from LambdaLabs if you want to try and replicate. (I upgrade torch to the latest release when I create the instance.)
Do I need to add anything extra to use the GPUs?
agolynski
(Alexander Golynski)
April 15, 2021, 4:56pm
5
The code you have doesn’t use GPUs, it’s CPU only tensors.
I suggest add some print statements in your model before and after critical sections of the code, i.e.
dist.init_process_group(“gloo”, rank=rank, world_size=world_size)
forward pass
backward pass
optimizer.step()
and see which line causes your error.