Strange very slow training on the server

Hi all,

I come across a strange problem when I train my model on the server. At the first several iterations, the running time is OK. All the CPUs are working for IO and GPU is working. But after several iterations, suddenly the CPUs do not work for IO. I do not know what happened.

Could someone help me? Thank you very much!

During the first several iterations:

Suddenly, CPUs are not working:

And it continues like this. Very slow.


How is the GPU usage?
At the beginning, the dataloader preloads many samples so that might explain the high cpu usage.

Hi The GPU is always zero. I think my dataset is too large and I should put it in a mounted storage machine. Maybe it is the reason of IO problem?

Might be unrelated, but how many workers are you using? Also do you set pin_memory=True?

Its mainly caused by IO problem. It seems that you used a super computer, but the swap area is too small. When loading data from disk, it would comsumes too much memory for data buffering.

Here is some suggestions:

  1. mapping the input feature to output label one by one and contiguously write to disk.
  2. clean the swap area.
  3. use other machine if possible.

Right now I use 8 workers. Yes, I set pin_memory=True. After I put the data back into the local disk, the speed becomes normally.