Not sure how to fix this. Only happens in NCCL 2.17.1
Please note that I am using an NVIDIA PyTorch docker that has PyTorch and NCCL installed. I assume this line has changed in PyTorch 2.x.x?
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
However, I checked the documentation of torch 2 and it’s using same syntax for DDP. So could you please help me figure how to fix it? I am using this PyTorch dockerfile: nvcr.io/nvidia/pytorch:23.04-py3
I am running the job on an Azure Cluster with 2 nodes each with 2 V100 GPUs.
Traceback (most recent call last):
File "train.py", line 163, in <module>
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 631, in __init__
current_cga = default_pg_nccl.options.config.cga_cluster_size
AttributeError: 'torch._C._distributed_c10d._ProcessGroupWrapper' object has no attribute 'options'
total 0
Thu Jun 1 15:57:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000001:00:00.0 Off | Off |
| N/A 28C P0 36W / 250W | 306MiB / 16160MiB | 14% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000002:00:00.0 Off | Off |
| N/A 27C P0 24W / 250W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
CPython
3.8.10
uname_result(system='Linux', node='6a1c317608e24783b72ea7865b6b88fd000001', release='5.15.0-1038-azure', version='#45~20.04.1-Ubuntu SMP Tue Apr 25 18:45:15 UTC 2023', machine='x86_64', processor='x86_64')
NCCL version is: (2, 17, 1)
System information: Linux #45~20.04.1-Ubuntu SMP Tue Apr 25 18:45:15 UTC 2023
Python version: 3.8.10
MLflow version: 2.3.2
MLflow module location: /usr/local/lib/python3.8/dist-packages/mlflow/__init__.py
Tracking URI: URI
Registry URI: URI
MLflow environment variables:
MLFLOW_DISABLE_ENV_MANAGER_CONDA_WARNING: True
MLFLOW_EXPERIMENT_ID: 97cdf0ad-6496-41c6-92a3-609b2474fa29
MLFLOW_EXPERIMENT_NAME: dev_CIFAR10_DDP_train_test2
MLFLOW_RUN_ID: e2fde4d3-d883-4134-8e7d-57223afad43d
MLFLOW_TRACKING_TOKEN: token
MLFLOW_TRACKING_URI: URI
MLflow dependencies:
Flask: 2.3.2
Jinja2: 3.1.2
alembic: 1.11.1
click: 8.1.3
cloudpickle: 2.2.1
databricks-cli: 0.17.7
docker: 6.1.3
entrypoints: 0.4
gitpython: 3.1.31
gunicorn: 20.1.0
importlib-metadata: 6.3.0
markdown: 3.4.3
matplotlib: 3.7.1
numpy: 1.22.2
packaging: 23.0
pandas: 1.5.2
protobuf: 3.20.3
pyarrow: 10.0.1.dev0+ga6eabc2b.d20230410
pytz: 2023.3
pyyaml: 6.0
querystring-parser: 1.2.4
requests: 2.28.2
scikit-learn: 1.2.0
scipy: 1.10.1
sqlalchemy: 2.0.15
sqlparse: 0.4.4
INFO:__main__:os.getpid() is 23 and initializing process group with {'MASTER_ADDR': '10.0.0.5', 'MASTER_PORT': '6105', 'LOCAL_RANK': '0', 'RANK': '0', 'WORLD_SIZE': '4'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.17.1+cuda12.1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO P2P plugin IBext
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NET/IB : No device found.
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Using network Socket
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531444234/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531444234/pci0001:00/0001:00:00.0/../max_link_width, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531444234/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531444234/pci0002:00/0002:00:00.0/../max_link_width, ignoring
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/6045bdda-0489-6045-bdda-04896045bdda is not a PCI device (vmbus). Attaching to first CPU
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO === System : maxBw 5.0 totalBw 12.0 ===
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO CPU/0 (1/1/1)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO + PCI[5000.0] - NIC/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO + NET[5.0] - NET/0 (0/0/5.000000)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO + PCI[12.0] - GPU/100000 (0)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO + PCI[12.0] - GPU/200000 (1)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO ==========================================
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC)
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 5.000000/5.000000, type PHB/PHB, sameChannels 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO 0 : NET/0 GPU/0 GPU/1 NET/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 6.000000/5.000000, type PHB/PHB, sameChannels 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO 0 : NET/0 GPU/0 GPU/1 NET/0
6a1c317608e24783b72ea7865b6b/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
World size: 4
local rank is 0 and world rank is 0
PyTorch version is 2.1.0a0+fe05266 and torchvision version is 0.15.0a0
0%| | 0.00/97.8M [00:00<?, ?B/s]
28%|██▊ | 27.2M/97.8M [00:00<00:00, 285MB/s]
59%|█████▊ | 57.4M/97.8M [00:00<00:00, 304MB/s]
88%|████████▊ | 86.4M/97.8M [00:00<00:00, 300MB/s]
100%|██████████| 97.8M/97.8M [00:00<00:00, 300MB/s]
88fd000001:23:124 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type NVL/PIX, sameChannels 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/2/-1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Tree 1 : 2 -> 0 -> 1/-1/-1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/02 : 0 1 2 3
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/02 : 0 1 2 3
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO P2P Chunksize set to 131072
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/0 : 3[200000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/0 : 3[200000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00 : 0[100000] -> 1[200000] via SHM/direct/direct
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01 : 0[100000] -> 1[200000] via SHM/direct/direct
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Connected all rings
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/0 : 2[100000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/0 : 2[100000] -> 0[100000] [receive] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 00/0 : 0[100000] -> 2[100000] [send] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Channel 01/0 : 0[100000] -> 2[100000] [send] via NET/Socket/0
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO Connected all trees
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO NCCL_P2P_PXN_LEVEL set by environment to 0.
6a1c317608e24783b72ea7865b6b88fd000001:23:124 [0] NCCL INFO comm 0x9e0a800 rank 0 nranks 4 cudaDev 0 busId 100000 commId 0x698090346b34d31a - Init COMPLETE
Traceback (most recent call last):
File "train.py", line 163, in <module>
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 631, in __init__
current_cga = default_pg_nccl.options.config.cga_cluster_size
AttributeError: 'torch._C._distributed_c10d._ProcessGroupWrapper' object has no attribute 'options'
6a1c317608e24783b72ea7865b6b88fd000001:23:128 [0] NCCL INFO [Service thread] Connection closed by localRank 0
6a1c317608e24783b72ea7865b6b88fd000001:23:23 [0] NCCL INFO comm 0x9e0a800 rank 0 nranks 4 cudaDev 0 busId 100000 - Abort COMPLETE
Here’s the Dockerfile
FROM nvcr.io/nvidia/pytorch:23.04-py3
##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0
# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
cd /tmp/nccltests && \
git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
cd nccl-tests && \
make \
MPI=1 MPI_HOME=/opt/hpcx/ompi \
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52" \
CUDA_HOME=/usr/local/cuda && \
cp ./build/* /usr/local/bin && \
rm -rf /tmp/nccltests
# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt
# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft
# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"
# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"
# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
# ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
# suggested to set ENV NCCL_IB_PCI_RELAXED_ORDERING to 0 for NCCL 2.18.1
ENV NCCL_IB_PCI_RELAXED_ORDERING="0"
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"
ENV NCCL_P2P_PXN_LEVEL="0"
# ENV NCCL_P2P_DISABLE="1"
# ENV NCCL_SOCKET_IFNAME='lo'
ENV NCCL_IB_DISABLE="1"
Here’s the environmental variable JSON:
{
"NCCL_DEBUG": "INFO",
"NCCL_IB_PCI_RELAXED_ORDERING": "0",
"NCCL_IB_DISABLE": "1",
"NCCL_SOCKET_IFNAME": "eth0",
"NCCL_P2P_PXN_LEVEL": "0",
"CUDA_DEVICE_ORDER": "PCI_BUS_ID",
"TORCH_DISTRIBUTED_DEBUG": "DETAIL"
}