NCCL Single machine multi card training error

error info:To avoida data incosistency, we are taking the entire process down
In the training process, some training samples will report errors in the calculation process due to the use of some random cutting functions. Therefore, I choose to skip these training samples. The consequence of this operation is that errors will be reported during single machine multi card training, and there will be no problems during single machine single card training. Just like the error message, it should be the data inconsistency caused by skipping the data sample. How to solve this problem? Is there any way to make all video cards skip a training together?

I’m very worried. Could you please answer this question for me