I come across a strange problem when I train my model on the server. At the first several iterations, the running time is OK. All the CPUs are working for IO and GPU is working. But after several iterations, suddenly the CPUs do not work for IO. I do not know what happened.
Its mainly caused by IO problem. It seems that you used a super computer, but the swap area is too small. When loading data from disk, it would comsumes too much memory for data buffering.
Here is some suggestions:
mapping the input feature to output label one by one and contiguously write to disk.