DDP training fast initially, then slows to a crawl but does not crash


I’m using DDP to train using the torch.distributed.launch command.

Training is normal for about 700 batches, then it consistently completely slows down at this point and does not recover (training continues, just very slowly).

Memory usage remains constant after the slowdown. htop reveals that CPU usage drops off heavily – at the beginning, all 40 cores are highly utilized, but at the point of slowdown, only 2 or 3 processes continue with high utilization (close to 100%) and the rest of them fall to around 1-5%.

Nvidia-smi reveals that GPU utilization locks at 100% for a subset of GPUs, and then 0% for the others-- this is the case whether training with 2 or 4 GPUs.

It looks to me like there is some kind of synchronization problem, but I’m not sure how to confirm that, or how to address it, if true.

I’ve tried different settings for OMP_NUM_THREADS without any change.

I’ve also tried KMP_AFFINITY=granularity=fine,compact,1,0

CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
GPUs: 4 x TeslaV100 23GB
CentOS: centos-release-7-9.2009.1.el7.centos.x86_64
PyTorch: 1.12.1
CUDA: 10.1

Using the compact transformers training script from: