Did you check if all ranks process the same number of batches? If that’s not the case the finished rank would block the communication of the others triggering the timeout eventually.
Did you check if all ranks process the same number of batches? If that’s not the case the finished rank would block the communication of the others triggering the timeout eventually.