How to fix a SIGSEGV in pytorch when using distributed training (e.g. DDP)?

Hey @Brando_Miranda,

I have a very similar if not the same issue (difficult to say). Have you found a solution to this problem? In my case this issue also occurs rather infrequently. Running on the same server (same GPUs, environment, etc.) training my model sometimes is successful and sometimes ends with SIGSEGV.

Cheers

Edit: If it is of any help, I posted my code here.

1 Like