Detected mismatch between collectives on ranks

I am trying to train a network on a system with multiple GPUs, but I keep getting this error and I can’t seem to track down what the issue might be. The NCCL logs (below) don’t seem to have the failing broadcast op listed for either rank?

RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[34112], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))

In both ranks the inputs have the same dimensions ([18, 3, 512, 512]) and the exact same model is being used in both ranks.

The model is constructed like this

    torch.distributed.init_process_group(backend=torch.distributed.Backend.NCCL, rank=rank, world_size=world_size)
    model = CenterNet(
        num_classes=config.dataset.num_classes,
        use_fpn=config.model.use_fpn,
        use_separable_conv=config.model.use_separable_conv,
        device=torch_device,
    ).to(torch_device)
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
    model = torch.nn.parallel.DistributedDataParallel(
        model, device_ids=[rank], output_device=rank, find_unused_parameters=False
    )

Sometimes the error appears on a all_reduce op that seems to related to SyncBatchNorm (I will try and get a stack trace for this next time it shows up)

Can anyone suggest how I can get more information about what is failing?

I am using a docker image based on pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel and I have set the following environment variables

    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = find_free_port()
    os.environ["NCCL_DEBUG"] = "INFO"
    os.environ["TORCH_CPP_LOG_LEVEL"] = "INFO"
    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    os.environ["NCCL_DEBUG_SUBSYS"] = "COLL"
    os.environ["NCCL_DEBUG_FILE"] = "/output/nccl_logs.txt"

NCCL logs show

f85219af9a4b:13:13 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
f85219af9a4b:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f23a9ffe200 recvbuff 0x7f23a9ffe200 count 5712 datatype 0 op 0 root 0 comm 0x556062238000 [nranks=2] stream 0x55607b5f17d0
f85219af9a4b:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbufff85219af9a4b:12:12 
[0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb120ebc000 recvbuff 0x7fb120ebc000 count 9197968 datatype 0 op f85219af9a4b:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f23a9de9600 recvbuff 0x7f23a9de9600 count 416 datatype 0 op 0 root 0 comm 0x556062238000 [nranks=2] stream 0x55607b5f17d0
f85219af9a4b:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb121b77400 recvbuff 0x7fb121b77400 count 136448 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb119dfea00 recvbuff 0x7fb119dfea00 count 416 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f23a9ffec00 recvbuff 0x7f23a9ffee00 count 260 datatype 0 op 0 root 0 comm 0x556062238000 [nranks=2] stream 0x55607b5f17d0
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f23a9fff000 recvbuff 0x7f23a9fff400 count 260 datatype 0 op 0 root 0 comm 0x556062238000 [nranks=2] stream 0x55607b5f17d0
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f23a9fffe00 recvbuff 0x7f23b1df0800 count 132 datatype 0 op 0 root 0 comm 0x556062238000 [nranks=2] stream 0x55607b5f17d0
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f23b1df1000 recvbuff 0x7f23b1df1400 count 772 datatype 0 op 0 root 0 comm 0x556062238000 [nranks=2] stream 0x55607b5f17d0
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df2800 recvbuff 0x7fb121df2c00 count 772 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df3a00 recvbuff 0x7fb121df3c00 count 196 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df4600 recvbuff 0x7fb121df4c00 count 1156 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df5e00 recvbuff 0x7fb121df6400 count 1156 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df4200 recvbuff 0x7fb121df4400 count 196 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df7c00 recvbuff 0x7fb121df8200 count 1156 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df8c00 recvbuff 0x7fb121df9200 count 1156 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df7800 recvbuff 0x7fb121dfa400 count 260 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121dfb200 recvbuff 0x7fb121dfba00 count 1540 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121dfd000 recvbuff 0x7fb121dfd800 count 1540 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121dfb000 recvbuff 0x7fb121dfee00 count 260 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f96000 recvbuff 0x7fb121f96800 count 1540 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f97e00 recvbuff 0x7fb121f98600 count 1540 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121dff800 recvbuff 0x7fb121f99c00 count 260 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f9a000 recvbuff 0x7fb121f9a800 count 1540 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f9ba00 recvbuff 0x7fb121f9c200 count 1540 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df2a00 recvbuff 0x7fb121df2e00 count 516 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121bf8200 recvbuff 0x7fb121bf9000 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121bfb600 recvbuff 0x7fb121bfc400 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121bf7c00 recvbuff 0x7fb121bfea00 count 516 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f9d000 recvbuff 0x7fb121f9de00 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121fa0400 recvbuff 0x7fb121fa1200 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121bff400 recvbuff 0x7fb121bff800 count 516 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121fa3a00 recvbuff 0x7fb121fa4800 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df7c00 recvbuff 0x7fb121fa6e00 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f23b1df1800 recvbuff 0x7f23b1df7000 count 516 datatype 0 op 0 root 0 comm 0x556062238000 [nranks=2] stream 0x55607b5f17d0
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121faa200 recvbuff 0x7fb121fab000 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121df8c00 recvbuff 0x7fb121fad600 count 3076 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121dfbe00 recvbuff 0x7fb121fa9600 count 772 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb0c95c0000 recvbuff 0x7fb0c95c1400 count 4612 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f96000 recvbuff 0x7fb0c95c4e00 count 4612 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121dfee00 recvbuff 0x7fb121dfd000 count 772 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f99c00 recvbuff 0x7fb0c95c8800 count 4612 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f9ba00 recvbuff 0x7fb0c95cb800 count 4612 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121bf7c00 recvbuff 0x7fb121f97e00 count 772 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121fa6e00 recvbuff 0x7fb121bfb600 count 4612 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121bf8c00 recvbuff 0x7fb121fa0400 count 4612 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121bfea00 recvbuff 0x7fb121df7c00 count 1284 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121fa3a00 recvbuff 0x7fb0c15d0000 count 7684 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121f9ce00 recvbuff 0x7fb0c15d3e00 count 7684 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb0c95c6e00 recvbuff 0x7fb0c95c4e00 count 1284 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb0c95c8800 recvbuff 0x7fb0c15d7c00 count 7684 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121fa0400 recvbuff 0x7fb0c15dba00 count 7684 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb0c95ca800 recvbuff 0x7fb121df7c00 count 1284 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121fa0400 recvbuff 0x7fb121f9ba00 count 7684 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121fa3a00 recvbuff 0x7fb121f9ba00 count 7684 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb0c95c4e00 recvbuff 0x7fb121fa4a00 count 2564 datatype 0 op 
f85219af9a4b:13:13 [1] NCCL INFO AllGather: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fb121ffd600 recvbuff 0x7fb0c14a0000 count 10244 datatype 0 op 0 root 0 comm 0x55f6e0728000 [nranks=2] stream 0x55f6df1e9900

Other output (stdout/stderr)

[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I socket.cpp:417] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:462] [c10d - debug] The server socket is attempting to listen on [::]:37861.
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:37861.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 37861).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to [localhost]:37861.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:37861 on [localhost]:47436.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:37861 has accepted a connection from [localhost]:47436.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 37861).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to [localhost]:37861.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:37861 on [localhost]:47438.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:37861 has accepted a connection from [localhost]:47438.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 37861).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to [localhost]:37861.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:37861 on [localhost]:47440.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:37861 has accepted a connection from [localhost]:47440.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 37861).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to [localhost]:37861.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:37861 on [localhost]:47442.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:37861 has accepted a connection from [localhost]:47442.
[I ProcessGroupNCCL.cpp:588] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: INFO
[I ProcessGroupNCCL.cpp:733] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:588] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: INFO
[I ProcessGroupNCCL.cpp:733] [Rank 0] NCCL watchdog thread started!
NCCL version 2.10.3+cuda11.3
[I reducer.cpp:110] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I reducer.cpp:110] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576
[I logger.cpp:218] [Rank 0]: DDP Initialized with: 
broadcast_buffers: 1
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 0
has_sync_bn: 1
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 174
output_device: 0
rank: 0
total_parameter_size_bytes: 9061520
world_size: 2
backend_name: nccl
bucket_sizes: 8005776, 1055744
cuda_visible_devices: N/A
device_ids: 0
dtypes: float
initial_bucket_size_limits: 26214400, 1048576
master_addr: localhost
master_port: 37861
module_name: CenterNet
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: INFO
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

[I logger.cpp:218] [Rank 1]: DDP Initialized with: 
broadcast_buffers: 1
bucket_cap_bytes: 26214400
find_unused_parameters: 0
gradient_as_bucket_view: 0
has_sync_bn: 1
is_multi_device_module: 0
iteration: 0
num_parameter_tensors: 174
output_device: 1
rank: 1
total_parameter_size_bytes: 9061520
world_size: 2
backend_name: nccl
bucket_sizes: 8005776, 1055744
cuda_visible_devices: N/A
device_ids: 1
dtypes: float
initial_bucket_size_limits: 26214400, 1048576
master_addr: localhost
master_port: 37861
module_name: CenterNet
nccl_async_error_handling: N/A
nccl_blocking_wait: N/A
nccl_debug: INFO
nccl_ib_timeout: N/A
nccl_nthreads: N/A
nccl_socket_ifname: N/A
torch_distributed_debug: DETAIL

Rank 1: Step 0: inputs: torch.Size([18, 3, 512, 512])
Epoch 0/300000
Batch 0/3257: Rank 0: Step 0: inputs: torch.Size([18, 3, 512, 512])
[I logger.cpp:382] [Rank 0 / 2] [iteration 1] Training CenterNet unused_parameter_size=0 
 Avg forward compute time: 1156088960 
 Avg backward compute time: 0 
Avg backward comm. time: 0 
 Avg backward comm/comp overlap time: 0
[I ProcessGroupNCCL.cpp:735] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:735] [Rank 1] NCCL watchdog thread terminated normally
Traceback (most recent call last):
  File "./centernet.py", line 357, in <module>
    torch.multiprocessing.spawn(
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/centernet.py", line 196, in train_ddp_model
    y_pred = model(inputs).to(torch_device)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 955, in forward
    self._sync_buffers()
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1602, in _sync_buffers
    self._sync_module_buffers(authoritative_rank)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1606, in _sync_module_buffers
    self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1627, in _default_broadcast_coalesced
    self._distributed_broadcast_coalesced(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1543, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[34112], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))

Stack trace for the SyncBatchNorm error

Traceback (most recent call last):
  File "./centernet.py", line 357, in <module>
    torch.multiprocessing.spawn(
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/centernet.py", line 200, in train_ddp_model
    loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/_functions.py", line 99, in backward
    torch.distributed.all_reduce(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running inconsistent collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[2560], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))

and NCCL logs

b73bedfc9f02:13:13 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
b73bedfc9f02:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f094dffe200 recvbuff 0x7f094dffe200 count 5712 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f0954ebc000 recvbuff 0x7f0954ebc000 count 9197968 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fe2e1dfea00 recvbuff 0x7fe2e1dfea00 count 416 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f0955b77400 recvbuff 0x7f0955b77400 count 136448 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f094dffe600 recvbuff 0x7f094dffe600 count 416 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f094dffec00 recvbuff 0x7f094dffee00 count 260 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f094dfff000 recvbuff 0x7f094dfff400 count 260 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f094dfffe00 recvbuff 0x7f0955df0800 count 132 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df1000 recvbuff 0x7f0955df1400 count 772 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df2200 recvbuff 0x7f0955df2600 count 772 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9df3a00 recvbuff 0x7fe2e9df3c00 count 196 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df4000 recvbuff 0x7f0955df4600 count 1156 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df5800 recvbuff 0x7f0955df5e00 count 1156 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df3800 recvbuff 0x7f0955df3a00 count 196 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df1600 recvbuff 0x7f0955df7000 count 1156 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df8600 recvbuff 0x7f0955df8c00 count 1156 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df1200 recvbuff 0x7f0955df9e00 count 260 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955dfac00 recvbuff 0x7f0955dfb400 count 1540 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955dfca00 recvbuff 0x7f0955dfd200 count 1540 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955dfaa00 recvbuff 0x7f0955dfe800 count 260 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955dff600 recvbuff 0x7f0955b77400 count 1540 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b78a00 recvbuff 0x7f0955b79200 count 1540 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955dff400 recvbuff 0x7f0955df2200 count 260 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b7aa00 recvbuff 0x7f0955b7b200 count 1540 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df4400 recvbuff 0x7f0955b7c400 count 1540 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955dff000 recvbuff 0x7f0955b7d600 count 516 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b7ec00 recvbuff 0x7f0955b7fa00 count 3076 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9bfbe00 recvbuff 0x7fe2e9bfcc00 count 3076 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b7e600 recvbuff 0x7f0955b85400 count 516 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9f9ce00 recvbuff 0x7fe2e9f9dc00 count 3076 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b89000 recvbuff 0x7f0955b89e00 count 3076 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b85c00 recvbuff 0x7f0955b86000 count 516 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9fa2a00 recvbuff 0x7fe2e9fa3800 count 3076 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9fa5e00 recvbuff 0x7fe2e9fa6c00 count 3076 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df7000 recvbuff 0x7f0955df7400 count 516 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9dfb000 recvbuff 0x7fe2e9fa9200 count 3076 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b94c00 recvbuff 0x7f0955b95a00 count 3076 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9dfa200 recvbuff 0x7fe2e9dfde00 count 772 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe28f5c0000 recvbuff 0x7fe28f5c1400 count 4612 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe28f5c3a00 recvbuff 0x7fe28f5c4e00 count 4612 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955df2200 recvbuff 0x7f0955df4400 count 772 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9bf8a00 recvbuff 0x7fe2e9bfbe00 count 4612 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9f9ce00 recvbuff 0x7fe2e9f9e200 count 4612 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9bffa00 recvbuff 0x7fe2e9df4800 count 772 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9fa9200 recvbuff 0x7fe2e9fa2a00 count 4612 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9fa5e00 recvbuff 0x7fe28f5c0000 count 4612 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9bf7400 recvbuff 0x7fe2e9bf8a00 count 1284 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9f9c800 recvbuff 0x7fe2e9ffb200 count 7684 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f0955b7ec00 recvbuff 0x7f0955b99e00 count 7684 datatype 0 op 0 root 0 comm 0x55912c3ae000 [nranks=2] stream 0x559144affda8
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9bf7400 recvbuff 0x7fe2e9bfbe00 count 1284 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9fa2a00 recvbuff 0x7fe2e9ffb200 count 7684 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9ffb200 recvbuff 0x7fe2e9f9c800 count 7684 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9dff800 recvbuff 0x7fe2e9ffd200 count 1284 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9f9c800 recvbuff 0x7fe2874a0000 count 7684 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9f9d800 recvbuff 0x7fe2874a0000 count 7684 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2e9bfbe00 recvbuff 0x7fe2e9ffd200 count 2564 datatype 0 op 
b73bedfc9f02:13:13 [1] NCCL INFO AllGather: opCount 0 sendbuffb73bedfc9f02:12:12 [0] NCCL INFO AllGather: opCount 0 sendbuff 0x7fe2874a1400 recvbuff 0x7fe2874a3e00 count 10244 datatype 0 op 0 root 0 comm 0x555bf5b10000 [nranks=2] stream 0x555bdb8c1e00

Do you have a minimal repro that we could potentially take a look at? In terms of the broadcast mismatch, one thing we could try out is printing the sizes of the tensors before this line: /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py to see if the collectives are using similar sized tensors.

Do you have a minimal repro that we could potentially take a look at?

Unfortunately, no.

In terms of the broadcast mismatch, one thing we could try out is printing the sizes of the tensors before this line: /opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py to see if the collectives are using similar sized tensors.

I am not sure which line you are referring to here.

Because it kept failing in two different places in a sort of non-deterministic manner I figured it might be a timing issue between the two processes, so I tried re-organising the order of some of the operations before the training loop (construction of datasets, loss function, optimiser, and printing out model summary) to try and keep the operations balanced in both processes. This seemed to resolve this issue or, at the very least, it has masked the problem.

Oh my bad that was a typo, I meant this line:

File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1543, in _distributed_broadcast_coalesced

@Bidski Some additional questions here, are you running on two ranks and one rank fails with

RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[34112], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))

and the other rank fails with:

RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running inconsistent collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[2560], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))

Just wanted to make sure these two errors were from the same training run? If not, is it possible to share the errors and stack traces from all the ranks when you run into this issue?

@pritamdamania87 The two errors are from separate training runs. The information I have already provided is all of the information that PyTorch/NCCL dumped to the console/log files.

File “/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 1543, in _distributed_broadcast_coalesced

I placed print statements both near this line (I put my prints in the function that calls this function) as well as in the C++ files. I wasn’t able to make out much from the extra logs I printed. IIRC on the python side I never saw the correct sized tensors, but I think I did in the C++ side. I think this is part of reason why I started to suspect a timing issue as it looked liked the tensors should have been broadcasted/all_reduced across both ranks but some part of the system seemed to think they weren’t.

You will notice in the NCCL logs for the broadcast error that there is one broadcast op listed that has the correct size (f85219af9a4b:13:13 [1] NCCL INFO Broadcast: opCount 0 sendbufff85219af9a4b:12:12 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb121b77400 recvbuff 0x7fb121b77400 count 136448 datatype 0 op136448 == 341121 * 4). However, in the NCCL logs for the all_reduce error there are no ops listed with the correct size (although there does appear to be some corruption from multiple processes writing to the logs, and an op with size 10244 which is 2560 * 4 + 4, so maybe that is the offending op?)

I see, is it possible to share the full log files for one of these training runs? I was interested in seeing what happens on Rank 1 when for example Rank 0 runs into the mismatch detection.

I see, is it possible to share the full log files for one of these training runs?

What I have already shown is the full log files, I don’t have anything else. I think I only cut off the start of the NCCL logs that showed initial setup information. The console output that I have shown is the only console output that was reported to me, there is no separate output for rank 0 and rank 1.

Hi Bidski,

Your problem could be that the the order of collectives are different across ranks. In this case, collectives from DDP and SyncBatchNorm are clashing with each other.

This can happen with DDP when the backward graph is different across ranks.

There are two things that could help here.

First, could you try set DDP’s __init__ argument find_unused_parameters to true? That will lead to decreased performance, but should help pinpoint the issue.

Second, to verify this, we need the log from all ranks as all it takes is one rank misbehaving.

First, could you try set DDP’s __init__ argument find_unused_parameters to true?

Did that. It reported no unused parameters.

Second, to verify this, we need the log from all ranks as all it takes is one rank misbehaving.

I have mentioned two or three times now that I have provided all of the logs that I have. I don’t have any separate logs for each of the ranks, only the logs that I have already provided.

I will repeat what I said a couple of days ago

Because it kept failing in two different places in a sort of non-deterministic manner I figured it might be a timing issue between the two processes, so I tried re-organising the order of some of the operations before the training loop (construction of datasets, loss function, optimiser, and printing out model summary) to try and keep the operations balanced in both processes. This seemed to resolve this issue or, at the very least, it has masked the problem.

There is still the possibility that part of the dataset processing will result in one rank taking slightly longer than the other to construct its next batch, but this should have no effect on the operations that are taking place in either the forward or the backward passes of the graph (unless dataloader operations are part of the graph??)

I think I am expressing the same issue. However, I am stuck and can’t go further. In my logs, I can see that all ranks have the same shape apart from rank 6. I am not sure rank 6’s shape as it isn’t logged for some reason. I am also providing the screenshot of processsed logs here.

In this case, rank 6 is performing a different collective than other ranks. It’s performing a barrier while the others are performing a broadcast.

Could it be that you have an uneven dataset and rank 6 is done while the other still going? If this is the case, you can try to use the join context manager.

Timing is usually not a problem since collectives wait for all ranks to join. Timing issues usually lead to performance problems are collectives take a lot longer due waiting for a straggler rank.

The most common issues with collectives are shape and ordering mismatch.

This is why Pritam’s suggestion of printing the input shapes could help you understand which issue you’re facing here.

Sometimes the mismatch is easy to detect, as it shows up in the batch dimension of your input and this would be a great place to start.

This is why Pritam’s suggestion of printing the input shapes could help you understand which issue you’re facing here.

Sometimes the mismatch is easy to detect, as it shows up in the batch dimension of your input and this would be a great place to start.

In my case the input shapes on both ranks were identical, as were the shapes for every layer in the model (all ranks were running an identical model)

I think I found the cause for my case. The details are mentioned here