Hi,
I was running a DDP example from this tutorial using the following command:
!torchrun --standalone --nproc_per_node=2 multigpu_torchrun.py 50 3
When I run it with 2 GPUs, everything is working fine, however when I increase the number of GPUs (3 in the example below) it fails with this error:
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
File "multigpu_torchrun.py", line 117, in <module>
File "multigpu_torchrun.py", line 117, in <module>
main(args.save_every, args.total_epochs, args.batch_size)main(args.save_every, args.total_epochs, args.batch_size)
File "multigpu_torchrun.py", line 104, in main
File "multigpu_torchrun.py", line 104, in main
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
File "multigpu_torchrun.py", line 53, in __init__
File "multigpu_torchrun.py", line 53, in __init__
self.model = DDP(self.model, device_ids=[self.gpu_id])self.model = DDP(self.model, device_ids=[self.gpu_id])
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 646, in __init__
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 646, in __init__
Traceback (most recent call last):
File "multigpu_torchrun.py", line 117, in <module>
_verify_param_shape_across_processes(self.process_group, parameters)_verify_param_shape_across_processes(self.process_group, parameters)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeErrorRuntimeError: : NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
main(args.save_every, args.total_epochs, args.batch_size)
File "multigpu_torchrun.py", line 104, in main
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
File "multigpu_torchrun.py", line 53, in __init__
self.model = DDP(self.model, device_ids=[self.gpu_id])
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 646, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 158408) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 11, in <module>
load_entry_point('torch', 'console_scripts', 'torchrun')()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
multigpu_torchrun.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-11-01_20:20:09
host : a1c0377be824
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 158409)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-11-01_20:20:09
host : a1c0377be824
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 158410)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-01_20:20:09
host : a1c0377be824
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 158408)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```
Any help is appreciated!