Following this excellent tutorial (Multinode Training — PyTorch Tutorials 2.0.1+cu117 documentation) I attempted to train a model on several devices. The ML pipeline seems to work since training for smaller models (less data) seems to complete correctly. However, the biggest model I am training requires a lot of training data and is therefore very resource intensive. Still, while monitoring the resources during training I can rule out that a lack of resources was the issue for the following error I received:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 280966) of binary...
Unfortunately I was unable to detect what exactly is causing this issue since I didn’t find any comprehensive docs. What I already tried:
- set
num_workers=0
in dataloader - decrease batch size
- limit
OMP_NUM_THREADS
The training script can be found here.
Any leads on this would be appreciated!