I was training a Yolo model with DDP, and got DDP error with messages below:
[1]:
time : 2024-04-17_15:48:22
host : mccx6
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 3882833)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3882833
[2]:
time : 2024-04-17_15:48:22
host : mccx6
rank : 6 (local_rank: 6)
exitcode : -6 (pid: 3882835)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3882835
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-17_15:48:22
host : mccx6
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 3882831)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3882831
=======================================================
It usally happened at the end of an epoch, and I don’t know what the exitcode -6 means.