Torch distributed fails on more than 2 GPUs

Hi,

I was running a DDP example from this tutorial using the following command:

!torchrun --standalone --nproc_per_node=2 multigpu_torchrun.py 50 3

When I run it with 2 GPUs, everything is working fine, however when I increase the number of GPUs (3 in the example below) it fails with this error:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "multigpu_torchrun.py", line 117, in <module>
  File "multigpu_torchrun.py", line 117, in <module>
        main(args.save_every, args.total_epochs, args.batch_size)main(args.save_every, args.total_epochs, args.batch_size)

  File "multigpu_torchrun.py", line 104, in main
  File "multigpu_torchrun.py", line 104, in main
        trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)

  File "multigpu_torchrun.py", line 53, in __init__
  File "multigpu_torchrun.py", line 53, in __init__
        self.model = DDP(self.model, device_ids=[self.gpu_id])self.model = DDP(self.model, device_ids=[self.gpu_id])

  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 646, in __init__
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 646, in __init__
Traceback (most recent call last):
  File "multigpu_torchrun.py", line 117, in <module>
        _verify_param_shape_across_processes(self.process_group, parameters)_verify_param_shape_across_processes(self.process_group, parameters)

  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
        return dist._verify_params_across_processes(process_group, tensors, logger)return dist._verify_params_across_processes(process_group, tensors, logger)

RuntimeErrorRuntimeError: : NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

    main(args.save_every, args.total_epochs, args.batch_size)
  File "multigpu_torchrun.py", line 104, in main
    trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
  File "multigpu_torchrun.py", line 53, in __init__
    self.model = DDP(self.model, device_ids=[self.gpu_id])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 158408) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 11, in <module>
    load_entry_point('torch', 'console_scripts', 'torchrun')()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
multigpu_torchrun.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-01_20:20:09
  host      : a1c0377be824
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 158409)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-11-01_20:20:09
  host      : a1c0377be824
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 158410)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-01_20:20:09
  host      : a1c0377be824
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 158408)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================```

Any help is appreciated!
  • Make sure PyTorch and NCCL versions match.
  • Set NCCL_DEBUG = INFO to get more details about the specific problem.
  • Ensure you have enough GPU storage.