Torchrun crashes when creating checkpoint, running with 2 GPUs

I’m trying to run an mmsegmentation based script. It runs smooth well when I run it on 1 GPU, however when I try to use Torchrun (or torch.distributed.launch) and run it on my two GPUs, it runs well the first 1000 epochs, but then it crashes when it’s supposed to create a checkpoint.

This is the error message that I’m getting:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-01_08:20:41
host : FabianDual2.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 67)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 67

/home/fabian/miniconda3/envs/mask2form/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

every help would be greatly appreciated.

Do you see any error messages in the stacktrace? E.g. the OS would kill the process if it’s running out of host memory to avoid a system crash so could you monitor your RAM and check if this could be the case?