Strange very slow training on the server

karlTUM · August 14, 2019, 12:36pm

Hi all,

I come across a strange problem when I train my model on the server. At the first several iterations, the running time is OK. All the CPUs are working for IO and GPU is working. But after several iterations, suddenly the CPUs do not work for IO. I do not know what happened.

Could someone help me? Thank you very much!

During the first several iterations:

Suddenly, CPUs are not working:

And it continues like this. Very slow.

albanD · August 15, 2019, 9:34am

Hi,

How is the GPU usage?
At the beginning, the dataloader preloads many samples so that might explain the high cpu usage.

karlTUM · August 15, 2019, 9:49am

Hi The GPU is always zero. I think my dataset is too large and I should put it in a mounted storage machine. Maybe it is the reason of IO problem?

ptrblck · August 16, 2019, 12:13am

Might be unrelated, but how many workers are you using? Also do you set pin_memory=True?

hasakii · August 16, 2019, 12:33am

Its mainly caused by IO problem. It seems that you used a super computer, but the swap area is too small. When loading data from disk, it would comsumes too much memory for data buffering.

Here is some suggestions:

mapping the input feature to output label one by one and contiguously write to disk.
clean the swap area.
use other machine if possible.

karlTUM · August 17, 2019, 12:18pm

Right now I use 8 workers. Yes, I set pin_memory=True. After I put the data back into the local disk, the speed becomes normally.