Torchrun DistributedDataParallel training exitcode -7

Following this excellent tutorial (Multinode Training — PyTorch Tutorials 2.0.1+cu117 documentation) I attempted to train a model on several devices. The ML pipeline seems to work since training for smaller models (less data) seems to complete correctly. However, the biggest model I am training requires a lot of training data and is therefore very resource intensive. Still, while monitoring the resources during training I can rule out that a lack of resources was the issue for the following error I received:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 280966) of binary...

Unfortunately I was unable to detect what exactly is causing this issue since I didn’t find any comprehensive docs. What I already tried:

  • set num_workers=0 in dataloader
  • decrease batch size
  • limit OMP_NUM_THREADS

The training script can be found here.

Any leads on this would be appreciated!