Terminate called after throwing an instance of 'c10::Error' what() : CUDA error: uncorrectable NVLink error detected during the execution

When running faster_rcnn,I found this error,I hope someone can help me!
I just runned epoch0

Could you rerun the script with:

NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL python script.py args

and post the logs here, please?

Sorry, I can’t run these scripts because I use a high computing platform to calculate,

Can I chat with you privately?

script.py and args are only placeholders and you should replace them with your actual script file name and its arguments.

thanks,I’ve solved it

How did you resolve the issue and what was the root cause?