Hi, when using DDP training on ImageNet, the process slows down, sometimes at the beginning of the training, sometimes it begins after several iterations, when using htop to look at the processes, the status of most of the processes go into D (disk sleep (uninterruptible)). The training does not stop but is very very slow.
The same codes several weeks ago, the training is fine and each iter only takes 50 seconds (When the processes status are in R, each iter also takes 50 seconds). Now this strange behavior happens, it can cost up to 15 minutes each iter (and the status of the processes are in D).
my launch script:
CUDA_VISIBLE_DEVICES=0,3,4,5 python -m torch.distributed.launch --nproc_per_node=4 train_imagenet_dist.py
Interesting, when I move the entire imagenet folder to SSD, not all the processes go into D (still 1 or 4 go into D), but the iter speed is normal at 50 seconds. It seems that something is wrong with HDD io.
same solve method with Strange behavior in Pytorch
But I think this is not a good way because ImageNet is 140g, while my SDD is only 2T. Also, this server is new, I bought it 4 months ago. I don’t think it is something wrong with HDD
Thanks for the question, albertipot,
A few things that come to mind:
- did you check that your IO throughput is close to the nominal throughput you’d expect from your HDD?
- what is the required IO throuput to feed your GPUs? You could look at how many images you can train per second when feeding random inputs.
- another thing you could do is try to remove / comment out all training code and keep only the data loading code and see if the same problem happens. This would narrow out the issue to either dataloading or something else.