DDP processes go into D status (disk sleep (uninterruptible))

Hi, when using DDP training on ImageNet, the process slows down, sometimes at the beginning of the training, sometimes it begins after several iterations, when using htop to look at the processes, the status of most of the processes go into D (disk sleep (uninterruptible)). The training does not stop but is very very slow.

The same codes several weeks ago, the training is fine and each iter only takes 50 seconds (When the processes status are in R, each iter also takes 50 seconds). Now this strange behavior happens, it can cost up to 15 minutes each iter (and the status of the processes are in D).

My env:
pytorch 1.8.1_cu111
python 3.8.12
Rtx3090

my launch script:
ARCH=‘cls_maskv1c100seed9999r7p8’
CUDA_VISIBLE_DEVICES=0,3,4,5 python -m torch.distributed.launch --nproc_per_node=4 train_imagenet_dist.py
–data ‘./data/imagenet’
–batch_size 1024
–learning_rate 0.5
–auxiliary
–save ‘imagenet_’${ARCH}‘_trainseed0’
–arch ${ARCH}
–workers 96

Interesting, when I move the entire imagenet folder to SSD, not all the processes go into D (still 1 or 4 go into D), but the iter speed is normal at 50 seconds. It seems that something is wrong with HDD io.

same solve method with Strange behavior in Pytorch

But I think this is not a good way because ImageNet is 140g, while my SDD is only 2T. Also, this server is new, I bought it 4 months ago. I don’t think it is something wrong with HDD

1 Like

Thanks for the question, albertipot,

A few things that come to mind:

  • did you check that your IO throughput is close to the nominal throughput you’d expect from your HDD?
  • what is the required IO throuput to feed your GPUs? You could look at how many images you can train per second when feeding random inputs.
  • another thing you could do is try to remove / comment out all training code and keep only the data loading code and see if the same problem happens. This would narrow out the issue to either dataloading or something else.

Hi, Alisson,

Thanks for the reply.

  1. When the issue occurred that all the data are stored in HDD and all processes go into D, from iostat, the reads are 13596kb_read/s and 482kb_wrtn/s, 328tps. Now all data is stored in SDD, and the reads are 111tps, 3895kb_read/s 8.6 wrtn/s. When training with other codes without DDP and with data stored in HDD, the training is normal.

  2. the batch_size is 1024 and I set num_worker to 96 which is 24 workers for each gpu, I think the io throughout is ok for these 3090 GPUs?

  3. I will try this once the current training is finished. I think this is a good way to find the problem.

Many thanks!