Problem with PyTorch 2.x.x and NCCL 2.17.1 -- AttributeError: 'torch._C._distributed_c10d._ProcessGroupWrapper' object has no attribute 'options'

Not sure how to fix this. Only happens in NCCL 2.17.1

Please note that I am using an NVIDIA PyTorch docker that has PyTorch and NCCL installed. I assume this line has changed in PyTorch 2.x.x?
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

However, I checked the documentation of torch 2 and it’s using same syntax for DDP. So could you please help me figure how to fix it? I am using this PyTorch dockerfile: nvcr.io/nvidia/pytorch:23.04-py3

I am running the job on an Azure Cluster with 2 nodes each with 2 V100 GPUs.

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 631, in __init__
    current_cga = default_pg_nccl.options.config.cga_cluster_size
AttributeError: 'torch._C._distributed_c10d._ProcessGroupWrapper' object has no attribute 'options'
total 0
Thu Jun  1 15:57:25 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0    36W / 250W |    306MiB / 16160MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000002:00:00.0 Off |                  Off |
| N/A   27C    P0    24W / 250W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
CPython
3.8.10
uname_result(system='Linux', node='6a1c317608e24783b72ea7865b6b88fd000001', release='5.15.0-1038-azure', version='#45~20.04.1-Ubuntu SMP Tue Apr 25 18:45:15 UTC 2023', machine='x86_64', processor='x86_64')
NCCL version is:  (2, 17, 1)
System information: Linux #45~20.04.1-Ubuntu SMP Tue Apr 25 18:45:15 UTC 2023
Python version: 3.8.10
MLflow version: 2.3.2
MLflow module location: /usr/local/lib/python3.8/dist-packages/mlflow/__init__.py
Tracking URI: URI
Registry URI: URI
MLflow environment variables: 
  MLFLOW_DISABLE_ENV_MANAGER_CONDA_WARNING: True
  MLFLOW_EXPERIMENT_ID: 97cdf0ad-6496-41c6-92a3-609b2474fa29
  MLFLOW_EXPERIMENT_NAME: dev_CIFAR10_DDP_train_test2
  MLFLOW_RUN_ID: e2fde4d3-d883-4134-8e7d-57223afad43d
  MLFLOW_TRACKING_TOKEN: token
  MLFLOW_TRACKING_URI: URI
MLflow dependencies: 
  Flask: 2.3.2
  Jinja2: 3.1.2
  alembic: 1.11.1
  click: 8.1.3
  cloudpickle: 2.2.1
  databricks-cli: 0.17.7
  docker: 6.1.3
  entrypoints: 0.4
  gitpython: 3.1.31
  gunicorn: 20.1.0
  importlib-metadata: 6.3.0
  markdown: 3.4.3
  matplotlib: 3.7.1
  numpy: 1.22.2
  packaging: 23.0
  pandas: 1.5.2
  protobuf: 3.20.3
  pyarrow: 10.0.1.dev0+ga6eabc2b.d20230410
  pytz: 2023.3
  pyyaml: 6.0
  querystring-parser: 1.2.4
  requests: 2.28.2
  scikit-learn: 1.2.0
  scipy: 1.10.1
  sqlalchemy: 2.0.15
  sqlparse: 0.4.4
INFO:__main__:os.getpid() is 23 and initializing process group with {'MASTER_ADDR': '10.0.0.5', 'MASTER_PORT': '6105', 'LOCAL_RANK': '0', 'RANK': '0', 'WORLD_SIZE': '4'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.17.1+cuda12.1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO P2P plugin IBext
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NET/IB : No device found.
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Using network Socket
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531444234/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531444234/pci0001:00/0001:00:00.0/../max_link_width, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531444234/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531444234/pci0002:00/0002:00:00.0/../max_link_width, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/6045bdda-0489-6045-bdda-04896045bdda is not a PCI device (vmbus). Attaching to first CPU
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO === System : maxBw 5.0 totalBw 12.0 ===
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO CPU/0 (1/1/1)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO + PCI[5000.0] - NIC/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO + PCI[12.0] - GPU/100000 (0)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO + PCI[12.0] - GPU/200000 (1)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO ==========================================
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 5.000000/5.000000, type PHB/PHB, sameChannels 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 NET/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 6.000000/5.000000, type PHB/PHB, sameChannels 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 NET/0
6a1c317608e24783b72ea7865b6b/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
World size: 4
local rank is 0 and world rank is 0
PyTorch version is 2.1.0a0+fe05266 and torchvision version is 0.15.0a0

  0%|          | 0.00/97.8M [00:00<?, ?B/s]
 28%|██▊       | 27.2M/97.8M [00:00<00:00, 285MB/s]
 59%|█████▊    | 57.4M/97.8M [00:00<00:00, 304MB/s]
 88%|████████▊ | 86.4M/97.8M [00:00<00:00, 300MB/s]
100%|██████████| 97.8M/97.8M [00:00<00:00, 300MB/s]
88fd000001:23:124 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type NVL/PIX, sameChannels 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/2/-1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Tree 1 : 2 -> 0 -> 1/-1/-1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/02 :    0   1   2   3
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/02 :    0   1   2   3
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO P2P Chunksize set to 131072
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/0 : 3[200000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/0 : 3[200000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00 : 0[100000] -> 1[200000] via SHM/direct/direct
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01 : 0[100000] -> 1[200000] via SHM/direct/direct
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Connected all rings
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/0 : 2[100000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/0 : 2[100000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/0 : 0[100000] -> 2[100000] [send] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/0 : 0[100000] -> 2[100000] [send] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Connected all trees
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_P2P_PXN_LEVEL set by environment to 0.
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO comm 0x9e0a800 rank 0 nranks 4 cudaDev 0 busId 100000 commId 0x698090346b34d31a - Init COMPLETE
Traceback (most recent call last):
  File "train.py", line 163, in <module>
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 631, in __init__
    current_cga = default_pg_nccl.options.config.cga_cluster_size
AttributeError: 'torch._C._distributed_c10d._ProcessGroupWrapper' object has no attribute 'options'
6a1c317608e24783b72ea7865b6b88fd000001:23:128 [0] NCCL INFO [Service thread] Connection closed by localRank 0
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO comm 0x9e0a800 rank 0 nranks 4 cudaDev 0 busId 100000 - Abort COMPLETE

Here’s the Dockerfile

FROM nvcr.io/nvidia/pytorch:23.04-py3


##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0

# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
    cd /tmp/nccltests && \
    git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make \
    MPI=1 MPI_HOME=/opt/hpcx/ompi \
    NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52" \
    CUDA_HOME=/usr/local/cuda && \
    cp ./build/* /usr/local/bin && \
    rm -rf /tmp/nccltests

# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt


# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
# ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
# suggested to set ENV NCCL_IB_PCI_RELAXED_ORDERING to 0 for NCCL 2.18.1
ENV NCCL_IB_PCI_RELAXED_ORDERING="0" 
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"
ENV NCCL_P2P_PXN_LEVEL="0"
# ENV NCCL_P2P_DISABLE="1"
# ENV NCCL_SOCKET_IFNAME='lo'
ENV NCCL_IB_DISABLE="1"

Here’s the environmental variable JSON:

{
    "NCCL_DEBUG": "INFO",
    "NCCL_IB_PCI_RELAXED_ORDERING": "0",
    "NCCL_IB_DISABLE": "1",
    "NCCL_SOCKET_IFNAME": "eth0",
    "NCCL_P2P_PXN_LEVEL": "0",
    "CUDA_DEVICE_ORDER": "PCI_BUS_ID",
    "TORCH_DISTRIBUTED_DEBUG": "DETAIL"
}