NCCL failing with A100 GPUs, works fine with V100 GPUs

Hello everyone! I tried solving this issue on my own but after a few days of trying to do so I have to concede… Admittedly, I am no expert when it comes to Linux in general and this is my first time working in a high performance computing environment.

Although I was able to utilise DDP with NCCL in the past in order to train my models, I noticed a few days ago that I would get weird errors (something along the lines of size could not be broadcast) which I did not get when training my models a month ago. I did change PyTorch and Python versions since then, which is why I wanted to eliminate as many variables as possible and decided to work on a toy example taken from the DDP guide.

The following is the code I use:

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")

    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = ToyModel().to(device_id)
    print("Initialising DDP")
    ddp_model = DDP(model, device_ids=[device_id])
    print("Initialised DDP")

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()
    dist.destroy_process_group()


if __name__ == "__main__":
    print(f"Running PyTorch {torch.__version__} with CUDA {torch.version.cuda} and NCCL {torch.cuda.nccl.version()}")
    demo_basic()

which I run using this command:

torchrun --nnodes 2 --nproc_per_node 1 --rdzv-backend c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT path/to/script.py

The script works perfectly fine when I run it on two nodes with one V100 GPU each using NCCL or when using the Gloo backend with A100 GPUs, but I just cannot get it to work using NCCL and two nodes with one (or more, this is just for debugging purposes) A100 each. This is the error log I am receiving:

Running PyTorch 2.2.2+cu121 with CUDA 12.1 and NCCL (2, 19, 3)
Start running basic DDP example on rank 1.
Initialising DDP
hpc-g4-1:52582:52582 [0] NCCL INFO cudaDriverVersion 12020
hpc-g4-1:52582:52582 [0] NCCL INFO Bootstrap : Using eth0:123.45.67.89<0>
hpc-g4-1:52582:52582 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hpc-g4-1:52582:52592 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth0:123.45.67.89<0>
hpc-g4-1:52582:52592 [0] NCCL INFO Using non-device net plugin version 0
hpc-g4-1:52582:52592 [0] NCCL INFO Using network IB
hpc-g4-1:52582:52592 [0] NCCL INFO comm 0x8aef670 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 60 commId 0xe200dbb906c36c4f - Init START
hpc-g4-1:52582:52592 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
hpc-g4-1:52582:52592 [0] NCCL INFO P2P Chunksize set to 131072
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IB/0
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IB/0

hpc-g4-1:52582:52594 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:520 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:647 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net.cc:677 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO transport/net.cc:304 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO transport.cc:148 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO init.cc:1117 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO init.cc:1396 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
hpc-g4-1:52582:52582 [0] NCCL INFO group.cc:418 -> 2
hpc-g4-1:52582:52582 [0] NCCL INFO group.cc:95 -> 2

hpc-g4-1:52582:52594 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:520 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:647 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net.cc:677 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO misc/socket.cc:47 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO misc/socket.cc:58 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO misc/socket.cc:773 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO proxy.cc:1374 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO proxy.cc:1415 -> 3

hpc-g4-1:52582:52594 [0] proxy.cc:1557 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
Traceback (most recent call last):
  File "/users/felix.schoen/data/projects/Project/./meta/playground/ddp/ddp.py", line 45, in <module>
    demo_basic()
  File "/users/felix.schoen/data/projects/Project/./meta/playground/ddp/ddp.py", line 29, in demo_basic
    ddp_model = DDP(model, device_ids=[device_id])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
Call to ibv_create_cq failed with error Cannot allocate memory
hpc-g4-1:52582:52582 [0] NCCL INFO comm 0x8aef670 rank 1 nranks 2 cudaDev 0 busId 60 - Abort COMPLETE
[2024-04-23 00:11:00,951] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 52582) of binary: /users/felix.schoen/data/projects/Project/venv/bin/python3.11
Traceback (most recent call last):
  File "/users/felix.schoen/data/projects/Project/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./meta/playground/ddp/ddp.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-23_00:11:00
  host      : hpc-g4-1.domain.com
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 52582)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

ulimit -l is already set to unlimited. According to the official documentation NCCL 2.19.3 (the version PyTorch ships with) doesn’t even support CUDA 12.1, although I find that hard to believe. According to nvidia-smi the current driver version installed is 535.129.03 with support for CUDA 12.2.

I’d be grateful for pointers on how to solve this, thanks in advance!

That’s indeed not the case and NCCL’s release notes mention their binary builds and the corresponding CUDA runtime version. We are building PyTorch with NCCL==2.19.3+CUDA12.1 for a long time already and don’t see any issues.

Is indeed the issue and you could try to run standalone NCCL tests to see if the issue is reproducible there or is related to PyTorch. In the former case, you might want to create an issue in the NCCL repository (and CC me there so that I can also track it).

Thanks for the quick reply! I ran the NCCL tests on the HPC system with one caveat: It currently only has NCCL modules up to version 2.18.3 with CUDA 12.1.1, which is what I used to run the tests.

I ran them on all four types of GPUs, these being P100s, V100s, Quadro RTX 6000s and A100s. Interestingly the tests all seemed to pass. Here is the output for the setup with two nodes with one V100 GPU each for example:

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  36665 on  clip-g2-3 device  0 [0x00] Tesla V100-PCIE-32GB
#  Rank  0 Group  0 Pid  43906 on  clip-g2-2 device  0 [0x00] Tesla V100-PCIE-32GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     3.49    0.00    0.00      0     0.12    0.07    0.00      0
          16             4     float     sum      -1     4.73    0.00    0.00      0     0.12    0.13    0.00      0
          32             8     float     sum      -1     4.52    0.01    0.00      0     0.12    0.27    0.00      0
          64            16     float     sum      -1     4.68    0.01    0.00      0     0.12    0.54    0.00      0
         128            32     float     sum      -1     5.26    0.02    0.00      0     0.12    1.05    0.00      0
         256            64     float     sum      -1     4.43    0.06    0.00      0     0.12    2.14    0.00      0
         512           128     float     sum      -1     4.50    0.11    0.00      0     0.12    4.30    0.00      0
        1024           256     float     sum      -1     4.40    0.23    0.00      0     0.12    8.61    0.00      0
        2048           512     float     sum      -1     4.23    0.48    0.00      0     0.12   17.20    0.00      0
        4096          1024     float     sum      -1     4.33    0.94    0.00      0     0.12   34.57    0.00      0
        8192          2048     float     sum      -1     4.56    1.80    0.00      0     0.12   68.55    0.00      0
       16384          4096     float     sum      -1     4.51    3.63    0.00      0     0.12  136.59    0.00      0
       32768          8192     float     sum      -1     4.98    6.58    0.00      0     0.12  272.95    0.00      0
       65536         16384     float     sum      -1     5.02   13.06    0.00      0     0.12  557.28    0.00      0
      131072         32768     float     sum      -1     4.60   28.48    0.00      0     0.12  1093.18    0.00      0
      262144         65536     float     sum      -1     4.45   58.88    0.00      0     0.12  2253.06    0.00      0
      524288        131072     float     sum      -1     4.87  107.71    0.00      0     0.12  4490.69    0.00      0
     1048576        262144     float     sum      -1     6.02  174.28    0.00      0     0.12  8566.80    0.00      0
     2097152        524288     float     sum      -1     8.46  247.96    0.00      0     0.12  17331.83    0.00      0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     4194304       1048576     float     sum      -1    14.71  285.20    0.00      0     0.12  33866.00    0.00      0
           8             2     float     sum      -1     3.95    0.00    0.00      0     0.13    0.06    0.00      0
          16             4     float     sum      -1     5.45    0.00    0.00      0     0.13    0.13    0.00      0
          32             8     float     sum      -1     4.64    0.01    0.00      0     0.13    0.25    0.00      0
          64            16     float     sum      -1     4.35    0.01    0.00      0     0.13    0.51    0.00      0
         128            32     float     sum      -1     4.26    0.03    0.00      0     0.12    1.03    0.00      0
     8388608       2097152     float     sum      -1    25.69  326.50    0.00      0     0.12  67189.49    0.00      0
         256            64     float     sum      -1     4.26    0.06    0.00      0     0.12    2.06    0.00      0
         512           128     float     sum      -1     4.28    0.12    0.00      0     0.13    4.04    0.00      0
        1024           256     float     sum      -1     4.12    0.25    0.00      0     0.13    7.97    0.00      0
        2048           512     float     sum      -1     4.06    0.50    0.00      0     0.13   16.31    0.00      0
        4096          1024     float     sum      -1     4.27    0.96    0.00      0     0.13   32.44    0.00      0
        8192          2048     float     sum      -1     3.97    2.06    0.00      0     0.12   65.56    0.00      0
       16384          4096     float     sum      -1     4.18    3.92    0.00      0     0.12  131.44    0.00      0
       32768          8192     float     sum      -1     4.51    7.27    0.00      0     0.13  258.73    0.00      0
       65536         16384     float     sum      -1     4.35   15.06    0.00      0     0.13  517.46    0.00      0
    16777216       4194304     float     sum      -1    47.19  355.52    0.00      0     0.13  126334.46    0.00      0
      131072         32768     float     sum      -1     4.20   31.21    0.00      0     0.38  344.25    0.00      0
    33554432       8388608     float     sum      -1    90.14  372.26    0.00      0     0.13  261123.98    0.00      0
      262144         65536     float     sum      -1     4.24   61.88    0.00      0     0.13  2090.46    0.00      0
      524288        131072     float     sum      -1     4.55  115.17    0.00      0     0.12  4246.97    0.00      0
    67108864      16777216     float     sum      -1    176.1  381.12    0.00      0     0.13  530924.56    0.00      0
     1048576        262144     float     sum      -1     6.45  162.54    0.00      0     0.12  8425.68    0.00      0
   134217728      33554432     float     sum      -1    348.2  385.44    0.00      0     0.12  1080223.16    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

     2097152        524288     float     sum      -1     9.41  222.83    0.00      0     0.13  16670.52    0.00      0
     4194304       1048576     float     sum      -1    14.64  286.41    0.00      0     0.13  33235.37    0.00      0
     8388608       2097152     float     sum      -1    26.00  322.62    0.00      0     0.13  65204.88    0.00      0
    16777216       4194304     float     sum      -1    47.17  355.68    0.00      0     0.13  133896.38    0.00      0
    33554432       8388608     float     sum      -1    90.09  372.46    0.00      0     0.13  265777.68    0.00      0
    67108864      16777216     float     sum      -1    176.0  381.28    0.00      0     0.13  531134.66    0.00      0
   134217728      33554432     float     sum      -1    348.0  385.63    0.00      0     0.13  1061849.11    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

In contrast, this is the output of the first run on two nodes with one A100 each:

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   9692 on clip-g4-11 device  0 [0xb1] NVIDIA A100-SXM4-40GB
#  Rank  0 Group  0 Pid  10235 on  clip-g4-9 device  0 [0x17] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     6.62    0.00    0.00      0     0.27    0.03    0.00      0
          16             4     float     sum      -1     6.46    0.00    0.00      0     0.27    0.06    0.00      0
          32             8     float     sum      -1     6.46    0.00    0.00      0     0.28    0.11    0.00      0
          64            16     float     sum      -1     6.39    0.01    0.00      0     0.27    0.24    0.00      0
         128            32     float     sum      -1     6.40    0.02    0.00      0     0.27    0.48    0.00      0
         256            64     float     sum      -1     6.41    0.04    0.00      0     0.27    0.96    0.00      0
         512           128     float     sum      -1     6.41    0.08    0.00      0     0.27    1.92    0.00      0
           8             2     float     sum      -1     6.06    0.00    0.00      0     0.24    0.03    0.00      0
        1024           256     float     sum      -1     6.43    0.16    0.00      0     0.20    5.24    0.00      0
          16             4     float     sum      -1     5.95    0.00    0.00      0     0.25    0.06    0.00      0
        2048           512     float     sum      -1     4.88    0.42    0.00      0     0.20   10.36    0.00      0
          32             8     float     sum      -1     5.92    0.01    0.00      0     0.25    0.13    0.00      0
        4096          1024     float     sum      -1     5.02    0.82    0.00      0     0.19   21.21    0.00      0
          64            16     float     sum      -1     5.09    0.01    0.00      0     0.18    0.36    0.00      0
        8192          2048     float     sum      -1     5.18    1.58    0.00      0     0.20   41.82    0.00      0
         128            32     float     sum      -1     4.44    0.03    0.00      0     0.18    0.73    0.00      0
         256            64     float     sum      -1     4.57    0.06    0.00      0     0.17    1.47    0.00      0
       16384          4096     float     sum      -1     4.91    3.34    0.00      0     0.22   73.95    0.00      0
         512           128     float     sum      -1     4.35    0.12    0.00      0     0.17    3.01    0.00      0
        1024           256     float     sum      -1     4.44    0.23    0.00      0     0.17    5.98    0.00      0
        2048           512     float     sum      -1     4.42    0.46    0.00      0     0.17   11.91    0.00      0
       32768          8192     float     sum      -1     5.01    6.54    0.00      0     0.22  148.41    0.00      0
        4096          1024     float     sum      -1     4.44    0.92    0.00      0     0.17   24.04    0.00      0
        8192          2048     float     sum      -1     4.43    1.85    0.00      0     0.18   46.37    0.00      0
       16384          4096     float     sum      -1     4.41    3.72    0.00      0     0.20   82.04    0.00      0
       32768          8192     float     sum      -1     4.47    7.33    0.00      0     0.13  247.59    0.00      0
       65536         16384     float     sum      -1     5.02   13.06    0.00      0     0.15  449.03    0.00      0
       65536         16384     float     sum      -1     4.05   16.17    0.00      0     0.14  482.41    0.00      0
      131072         32768     float     sum      -1     4.23   30.99    0.00      0     0.11  1147.74    0.00      0
      131072         32768     float     sum      -1     4.13   31.70    0.00      0     0.12  1094.09    0.00      0
      262144         65536     float     sum      -1     4.32   60.70    0.00      0     0.12  2205.67    0.00      0
      262144         65536     float     sum      -1     4.62   56.79    0.00      0     0.12  2270.63    0.00      0
      524288        131072     float     sum      -1     4.57  114.66    0.00      0     0.12  4484.93    0.00      0
      524288        131072     float     sum      -1     4.63  113.23    0.00      0     0.12  4508.07    0.00      0
     1048576        262144     float     sum      -1     6.29  166.79    0.00      0     0.12  9031.66    0.00      0
     1048576        262144     float     sum      -1     8.48  123.64    0.00      0     0.13  7983.07    0.00      0
     2097152        524288     float     sum      -1     8.09  259.22    0.00      0     0.12  17571.45    0.00      0
     2097152        524288     float     sum      -1     8.27  253.71    0.00      0     0.12  18055.55    0.00      0
     4194304       1048576     float     sum      -1    11.92  352.00    0.00      0     0.12  34620.75    0.00      0
     4194304       1048576     float     sum      -1    11.87  353.31    0.00      0     0.12  36157.79    0.00      0
     8388608       2097152     float     sum      -1    16.63  504.48    0.00      0     0.12  70879.66    0.00      0
     8388608       2097152     float     sum      -1    18.02  465.54    0.00      0     0.11  74698.20    0.00      0
    16777216       4194304     float     sum      -1    29.07  577.15    0.00      0     0.11  146079.37    0.00      0
    16777216       4194304     float     sum      -1    30.31  553.53    0.00      0     0.11  148274.11    0.00      0
    33554432       8388608     float     sum      -1    54.78  612.53    0.00      0     0.12  289262.34    0.00      0
    33554432       8388608     float     sum      -1    54.70  613.38    0.00      0     0.11  297864.47    0.00      0
    67108864      16777216     float     sum      -1    105.6  635.68    0.00      0     0.12  567037.30    0.00      0
    67108864      16777216     float     sum      -1    105.3  637.47    0.00      0     0.11  596788.47    0.00      0
   134217728      33554432     float     sum      -1    206.4  650.41    0.00      0     0.11  1167110.68    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

   134217728      33554432     float     sum      -1    206.6  649.63    0.00      0     0.12  1148632.67    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

Here, the formatting confused me with the first “out-of-place” block being empty, when running it again this issue was resolved though:

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  42962 on  clip-g4-3 device  0 [0x31] NVIDIA A100-SXM4-40GB
#  Rank  0 Group  0 Pid  17276 on  clip-g4-4 device  0 [0xca] NVIDIA A100-SXM4-40GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     5.61    0.00    0.00      0     0.17    0.05    0.00      0
          16             4     float     sum      -1     4.51    0.00    0.00      0     0.18    0.09    0.00      0
          32             8     float     sum      -1     4.50    0.01    0.00      0     0.17    0.18    0.00      0
          64            16     float     sum      -1     4.48    0.01    0.00      0     0.17    0.38    0.00      0
         128            32     float     sum      -1     4.50    0.03    0.00      0     0.17    0.75    0.00      0
         256            64     float     sum      -1     4.64    0.06    0.00      0     0.17    1.49    0.00      0
         512           128     float     sum      -1     4.45    0.12    0.00      0     0.17    3.02    0.00      0
        1024           256     float     sum      -1     4.65    0.22    0.00      0     0.17    6.08    0.00      0
        2048           512     float     sum      -1     4.44    0.46    0.00      0     0.17   12.08    0.00      0
        4096          1024     float     sum      -1     4.47    0.92    0.00      0     0.11   35.76    0.00      0
        8192          2048     float     sum      -1     3.76    2.18    0.00      0     0.12   70.08    0.00      0
       16384          4096     float     sum      -1     3.88    4.22    0.00      0     0.13  121.41    0.00      0
       32768          8192     float     sum      -1     3.95    8.29    0.00      0     0.13  247.87    0.00      0
       65536         16384     float     sum      -1     3.92   16.72    0.00      0     0.12  555.63    0.00      0
      131072         32768     float     sum      -1     4.01   32.66    0.00      0     0.12  1128.47    0.00      0
      262144         65536     float     sum      -1     4.25   61.75    0.00      0     0.11  2302.54    0.00      0
      524288        131072     float     sum      -1     4.49  116.84    0.00      0     0.12  4502.26    0.00      0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1     5.78  181.49    0.00      0     0.11  9185.95    0.00      0
           8             2     float     sum      -1     7.97    0.00    0.00      0     0.34    0.02    0.00      0
          16             4     float     sum      -1     9.78    0.00    0.00      0     0.33    0.05    0.00      0
          32             8     float     sum      -1     8.95    0.00    0.00      0     0.32    0.10    0.00      0
          64            16     float     sum      -1     9.39    0.01    0.00      0     0.32    0.20    0.00      0
     2097152        524288     float     sum      -1     7.63  274.92    0.00      0     0.11  18509.73    0.00      0
         128            32     float     sum      -1     8.77    0.01    0.00      0     0.33    0.39    0.00      0
         256            64     float     sum      -1     8.76    0.03    0.00      0     0.33    0.78    0.00      0
         512           128     float     sum      -1     8.39    0.06    0.00      0     0.33    1.56    0.00      0
        1024           256     float     sum      -1     7.11    0.14    0.00      0     0.21    4.95    0.00      0
        2048           512     float     sum      -1     6.10    0.34    0.00      0     0.21    9.73    0.00      0
        4096          1024     float     sum      -1     6.08    0.67    0.00      0     0.21   19.60    0.00      0
        8192          2048     float     sum      -1     6.03    1.36    0.00      0     0.21   39.55    0.00      0
       16384          4096     float     sum      -1     6.22    2.63    0.00      0     0.24   68.80    0.00      0
       32768          8192     float     sum      -1     6.22    5.27    0.00      0     0.25  133.28    0.00      0
       65536         16384     float     sum      -1     6.10   10.75    0.00      0     0.17  376.00    0.00      0
     4194304       1048576     float     sum      -1    11.47  365.79    0.00      0     0.11  36808.28    0.00      0
      131072         32768     float     sum      -1     4.90   26.75    0.00      0     0.12  1121.71    0.00      0
     8388608       2097152     float     sum      -1    17.45  480.85    0.00      0     0.11  73941.01    0.00      0
      262144         65536     float     sum      -1     4.05   64.74    0.00      0     0.11  2285.48    0.00      0
    16777216       4194304     float     sum      -1    29.65  565.77    0.00      0     0.11  149130.81    0.00      0
      524288        131072     float     sum      -1     4.40  119.21    0.00      0     0.12  4462.03    0.00      0
    33554432       8388608     float     sum      -1    55.45  605.08    0.00      0     0.11  293436.22    0.00      0
     1048576        262144     float     sum      -1     6.02  174.06    0.00      0     0.12  8996.79    0.00      0
    67108864      16777216     float     sum      -1    106.3  631.37    0.00      0     0.11  589449.84    0.00      0
     2097152        524288     float     sum      -1     7.82  268.08    0.00      0     0.12  17848.10    0.00      0
     4194304       1048576     float     sum      -1    11.73  357.69    0.00      0     0.11  36647.48    0.00      0
   134217728      33554432     float     sum      -1    207.1  648.05    0.00      0     0.11  1183056.22    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

     8388608       2097152     float     sum      -1    16.62  504.81    0.00      0     0.11  74400.07    0.00      0
    16777216       4194304     float     sum      -1    28.92  580.14    0.00      0     0.11  146206.68    0.00      0
    33554432       8388608     float     sum      -1    54.77  612.61    0.00      0     0.12  284600.78    0.00      0
    67108864      16777216     float     sum      -1    105.3  637.11    0.00      0     0.11  592572.75    0.00      0
   134217728      33554432     float     sum      -1    206.2  650.83    0.00      0     0.11  1167618.34    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

To me it seems like NCCL is working just fine but that there could be a configuration problem with the HPC system. Do you have any further pointers on how to fix these issues? I plan on contacting the administrators of the system in order to get this resolved and I’d like to be able to provide them with as much information as possible!

On an unrelated note: I really appreciate all your thousands of answers, they were really helpful in the past!