As soon as I get to 2nd epoch: Detected mismatch between collectives on ranks error

I am trying to train a model in a single machine with multiple GPUs using the dockerfile nvcr.io/nvidia/pytorch:22.12-py3 but I am getting the Runtime Error: Detected mismatch between collectives on ranks.

Some info:

  • The error only happens after I leave the first epoch and move to the second.
  • I am using barrier to synchronize processes before moving to the 2nd epoch, so I can generate metrics on validation set for rank 0.
  • I am setting find_unused_parameters=True but this doesn’t make difference and the problem seems to be another, given the point that the error only happens after I finish the 1st epoch.

Pseudocode below

def train_flow:
import torch.distributed as dist

# Your code for training loop
for epoch in range(num_epochs):
    # Training step
    sampler.set_epoch(epoch)
    train_epoch_loss, train_epoch_acc, train_epoch_pr_auc = train(ddp_model, train_loader, optimizer, criterion, rank)
    
    print(f"[INFO] - rank {rank} - Training loss: {train_epoch_loss:.4f}, "
              f"Training acc: {train_epoch_acc:.4f}, "
              f"Training pr_auc: {train_epoch_pr_auc:.4f}")

    print('-' * 50)

    # Synchronize before starting validation for rank 0 to avoid RunTime errors
    dist.barrier()

    # Validation step
    if rank == 0:
        # Perform validation on rank 0
        validation_result = validate(ddp_model, validation_data)

    # Synchronize all ranks after validation before moving to the next epoch
    dist.barrier()

if rank == 0:
    save_metrics_validation()

# Cleanup the distributed backend
dist.destroy_process_group()

In theory this should work, do you know what I am missing?

Error for 4 ranks:

Process Process-1:4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/go/src/model.py", line 311, in train_flow
    dist.barrier()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 2 is running collective: CollectiveFingerPrint(OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLGATHER_COALESCED, TensorShape=[], TensorDtypes=, TensorDeviceTypes=).
Process Process-1:5:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/go/src/model.py", line 311, in train_flow
    dist.barrier()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running collective: CollectiveFingerPrint(OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLGATHER_COALESCED, TensorShape=[], TensorDtypes=, TensorDeviceTypes=).
[INFO] - rank 2 - Training loss: 0.0073, Training acc: 0.9989, Training pr_auc: 0.8456
--------------------------------------------------
[INFO] - rank 3 - Training loss: 0.0078, Training acc: 0.9989, Training pr_auc: 0.8453
--------------------------------------------------
[INFO] - rank 1 - Training loss: 0.0078, Training acc: 0.9989, Training pr_auc: 0.8448
--------------------------------------------------
Process Process-1:3:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/go/src/model.py", line 311, in train_flow
    dist.barrier()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLGATHER_COALESCED, TensorShape=[], TensorDtypes=, TensorDeviceTypes=).

I faced the same issue. For me, I had a bug where each GPU process had its own dataset of different length.
Using the same dataset per GPU fixed my issue.

Presumably this messes up the synchronization on a torch.distributed.barrier.