During single-GPU training, the code runs successfully and can train multiple epochs (more than 4, with no further testing downward) without any errors. However, when using multiple GPUs for training, an error occurs, and it always happens when the first epoch is about to complete. The specific error is as follows:
RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[29836], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE_COALESCED).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2598132) of binary:
And the error occured at torch.distributed.all_reduce()(or torch.distributed.barrier()).When i tried commented out this part of the code, it can complete the training of 1 epoch, but an error occurs when training for 2 or more epochs,like this:
RuntimeError: Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. Original exception: [/opt/conda/condabld/pytorch_1670525541702/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [192.168.1.104]:37535 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 930361) of binary:
And the env is torch 2.1.1+cu121.If you need more details, I can provide them at any time. Thank you very much for helping me resolve the issue.