I’m trying to run an mmsegmentation based script. It runs smooth well when I run it on 1 GPU, however when I try to use Torchrun (or torch.distributed.launch) and run it on my two GPUs, it runs well the first 1000 epochs, but then it crashes when it’s supposed to create a checkpoint.
This is the error message that I’m getting:
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-03-01_08:20:41
host : FabianDual2.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 67)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 67
/home/fabian/miniconda3/envs/mask2form/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
every help would be greatly appreciated.