Torchrun crashes when creating checkpoint, running with 2 GPUs

fabiansc · March 1, 2023, 1:35pm

I’m trying to run an mmsegmentation based script. It runs smooth well when I run it on 1 GPU, however when I try to use Torchrun (or torch.distributed.launch) and run it on my two GPUs, it runs well the first 1000 epochs, but then it crashes when it’s supposed to create a checkpoint.

This is the error message that I’m getting:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-01_08:20:41
host : FabianDual2.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 67)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 67

/home/fabian/miniconda3/envs/mask2form/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

every help would be greatly appreciated.

ptrblck · March 2, 2023, 9:41am

Do you see any error messages in the stacktrace? E.g. the OS would kill the process if it’s running out of host memory to avoid a system crash so could you monitor your RAM and check if this could be the case?

Torchrun crashes when creating checkpoint, running with 2 GPUs

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-01_08:20:41 host : FabianDual2. rank : 0 (local_rank: 0) exitcode : -9 (pid: 67) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 67

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-01_08:20:41
host : FabianDual2.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 67)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 67