DistributedDataParallel: cuda init; forced parameter sync; pinning?

vadimkantorov · September 8, 2020, 11:53am

I’m trying to figure out DistributedDataParallel (on a single machine; single GPU / process mode). I’ve got a few questions:

Will all launched processes do CUDA init? Is it safe? Should we explicitly set CUDA_VISIBLE_DEVICES per launched process to ensure that it can see only one device? ( think fork/spawn global state issues with CUDA / OpenMP / pthreads etc…) Would it prevent it from rpc using nccl?
Is it sensible to do a forced parameter sync once in a while? Inherent GPU parallelism non-determinism can cause parameter divergence that could cause replica parameter divergence for some sensitive models (even if gradients are sychronized)? How does one do it?
When does it make sense to do NUMA node pinning? CPU affinity pinning?

Thank you!

mrshenli · September 10, 2020, 3:04am

CUDA is lazily initialized. So if one process is not touching a specific device, corresponding CUDA context shouldn’t be created on that device.

Is it safe?

Even if multiple processes create context on the same device, it won’t crash, but each context consumes about 500MB CUDA memory, which is not desired.

Should we explicitly set CUDA_VISIBLE_DEVICES per launched process to ensure that it can see only one device?

This is the recommended way, as it also rules out the possibility that some third-party library accidentally create tensors on other devices.

Would it prevent it from rpc using nccl?

Which rpc are you referring to? Are you using DDP in conjunction with torch.distributed.rpc?

Is it sensible to do a forced parameter sync once in a while? Inherent GPU parallelism non-determinism can cause parameter divergence that could cause replica parameter divergence for some sensitive models (even if gradients are sychronized)?

If there are drifts, then, yes, manually sync once a while would help.

How does one do it?

You can use broadcast, see the code linked below. If you would like to calculate average, you can also use all_reduce and then divide by world_size.

github.com

pytorch/pytorch/blob/3806c939bda0df4c0f38b9d356c004f384535ac1/torch/nn/parallel/distributed.py#L404


      
              self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024)
          
              # Sync params and buffers
              self._sync_params_and_buffers(authoritative_rank=0)
          
              self._ddp_init_helper()
          
          def _sync_params_and_buffers(self, authoritative_rank=0):
              module_states = list(self.module.state_dict().values())
              if len(module_states) > 0:
                  self._distributed_broadcast_coalesced(
                      module_states,
                      self.broadcast_bucket_size,
                      authoritative_rank)
          
          def _ddp_init_helper(self):
              """
              Initialization helper function that does the following:
          
              (1) replicating the module from device[0] to the other devices
              (2) bucketing the parameters for reductions

When does it make sense to do NUMA node pinning? CPU affinity pinning?

Hey @ptrblck, do you know what’s the best practice here?

vadimkantorov · September 11, 2020, 7:42am

Thanks a lot @mrshenli for these responses! Look like it should be in some official guide

@mrshenli I’m having troubles launching the most basic two-node distributed configuration (I checked, TCP connection with nc works ok). It doesn’t seem to respect the passed port. If you could take a look, it would be awesome!

UPD: I created an issue to discuss this: https://github.com/pytorch/pytorch/issues/44544

import os
import torch
import argparse
import torch.distributed as dist

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--backend', default='gloo')
    parser.add_argument('--rank', type=int, default=0)
    parser.add_argument('--world-size', type=int, default=1)
    args = parser.parse_args()
    dist.init_process_group(args.backend, init_method="env://", rank=args.rank, world_size=args.world_size)
    print(f"Master node {os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}. Rank {args.rank}. World size: {args.world_size}")
    test_tensor = torch.tensor(args.rank+1)
    if args.backend == 'nccl':
        test_tensor = test_tensor.cuda()
    dist.all_reduce(test_tensor, op=dist.ReduceOp.SUM)
    print(f"Test value: {test_tensor.item()}, expected: {sum(range(args.world_size+1))}")

Test standalone

Node 1
Input: 
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py
Output: 
Master node 10.81.13.54:12345. Rank 0. World size: 1
Test value: 1, expected: 1

Node 2
Input: 
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.51 MASTER_PORT=12345 python distributed_example.py 
Output: 
Master node 10.81.13.51:12345. Rank 0. World size: 1
Test value: 1, expected: 1

Test disctibuted Gloo

Node 1
Input: 
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 0 --world-size 2
Output: 
Traceback (most recent call last):
  File "distributed_example.py", line 11, in <module>
    dist.init_process_group('gloo', init_method="env://", rank=args.rank, world_size=args.world_size)
  File "/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 425, in init_process_group
    _default_pg = _new_process_group_helper(
  File "/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 499, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [/opt/conda/conda-bld/pytorch_1595629411241/work/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [10.81.13.51]:11169: No route to host

Node 2
Input:
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 1 --world-size 2
Output:

Test distributed NCCL

Node 1
Input:
NCCL_SOCKET_IFNAME=team0 NCCL_DEBUG=INFO MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 0 --world-size 2 --backend nccl
Output:
Master node 10.81.13.54:12345. Rank 0. World size: 2
srs-ds11:64570:64570 [0] NCCL INFO Bootstrap : Using [0]team0:10.81.13.54<0>
srs-ds11:64570:64570 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

srs-ds11:64570:64570 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
srs-ds11:64570:64570 [0] NCCL INFO NET/Socket : Using [0]team0:10.81.13.54<0>
NCCL version 2.4.8+cuda10.1
srs-ds11:64570:65266 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555

Node 2
Input:
NCCL_SOCKET_IFNAME=team0 NCCL_DEBUG=INFO MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 1 --world-size 2 --backend nccl
Output:
Master node 10.81.13.54:12345. Rank 1. World size: 2
srs-ds8:192240:192240 [0] NCCL INFO Bootstrap : Using [0]team0:10.81.13.51<0>
srs-ds8:192240:192240 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

srs-ds8:192240:192240 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
srs-ds8:192240:192240 [0] NCCL INFO NET/Socket : Using [0]team0:10.81.13.51<0>
srs-ds8:192240:192316 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555

srs-ds8:192240:192316 [0] include/socket.h:390 NCCL WARN Connect to 10.81.13.54<34419> failed : No route to host
srs-ds8:192240:192316 [0] NCCL INFO bootstrap.cc:100 -> 2
srs-ds8:192240:192316 [0] NCCL INFO bootstrap.cc:326 -> 2
srs-ds8:192240:192316 [0] NCCL INFO init.cc:695 -> 2
srs-ds8:192240:192316 [0] NCCL INFO init.cc:951 -> 2
srs-ds8:192240:192316 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
  File "distributed_example.py", line 17, in <module>
    dist.all_reduce(test_tensor, op=dist.ReduceOp.SUM)
  File "/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629411241/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8

mrshenli · September 11, 2020, 2:17pm

For the glooo case, I noticed that the master address for two nodes are different, is this a typo?

node 1: MASTER_ADDR=10.81.13.54
node 2: MASTER_ADDR=10.81.13.51

aabugaliev · September 11, 2020, 7:15pm

No it doesn’t. Addresses are different because it is standalone test (one node/one worker) for sanity check. Theare are three cases: single node test, multi node test with gloo backend and with nccl backend.