I am trying to train a model in a single machine with multiple GPUs using the dockerfile nvcr.io/nvidia/pytorch:22.12-py3
but I am getting the Runtime Error: Detected mismatch between collectives on ranks
.
Some info:
- The error only happens after I leave the first epoch and move to the second.
- I am using barrier to synchronize processes before moving to the 2nd epoch, so I can generate metrics on validation set for rank 0.
- I am setting
find_unused_parameters=True
but this doesn’t make difference and the problem seems to be another, given the point that the error only happens after I finish the 1st epoch.
Pseudocode below
def train_flow:
import torch.distributed as dist
# Your code for training loop
for epoch in range(num_epochs):
# Training step
sampler.set_epoch(epoch)
train_epoch_loss, train_epoch_acc, train_epoch_pr_auc = train(ddp_model, train_loader, optimizer, criterion, rank)
print(f"[INFO] - rank {rank} - Training loss: {train_epoch_loss:.4f}, "
f"Training acc: {train_epoch_acc:.4f}, "
f"Training pr_auc: {train_epoch_pr_auc:.4f}")
print('-' * 50)
# Synchronize before starting validation for rank 0 to avoid RunTime errors
dist.barrier()
# Validation step
if rank == 0:
# Perform validation on rank 0
validation_result = validate(ddp_model, validation_data)
# Synchronize all ranks after validation before moving to the next epoch
dist.barrier()
if rank == 0:
save_metrics_validation()
# Cleanup the distributed backend
dist.destroy_process_group()
In theory this should work, do you know what I am missing?
Error for 4 ranks:
Process Process-1:4:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/go/src/model.py", line 311, in train_flow
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 2 is running collective: CollectiveFingerPrint(OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLGATHER_COALESCED, TensorShape=[], TensorDtypes=, TensorDeviceTypes=).
Process Process-1:5:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/go/src/model.py", line 311, in train_flow
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running collective: CollectiveFingerPrint(OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLGATHER_COALESCED, TensorShape=[], TensorDtypes=, TensorDeviceTypes=).
[INFO] - rank 2 - Training loss: 0.0073, Training acc: 0.9989, Training pr_auc: 0.8456
--------------------------------------------------
[INFO] - rank 3 - Training loss: 0.0078, Training acc: 0.9989, Training pr_auc: 0.8453
--------------------------------------------------
[INFO] - rank 1 - Training loss: 0.0078, Training acc: 0.9989, Training pr_auc: 0.8448
--------------------------------------------------
Process Process-1:3:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/go/src/model.py", line 311, in train_flow
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3193, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=BARRIER), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLGATHER_COALESCED, TensorShape=[], TensorDtypes=, TensorDeviceTypes=).