I come across a strange problem when I train my model on the server. At the first several iterations, the running time is OK. All the CPUs are working for IO and GPU is working. But after several iterations, suddenly the CPUs do not work for IO. I do not know what happened.
Could someone help me? Thank you very much!
During the first several iterations:
Suddenly, CPUs are not working:
And it continues like this. Very slow.