Why is my training bottlenecked on CPU?

[Edit: Sorry if my question is just a variation on the question: why are hand-implemented RNNs so slow – which has been asked already before. However, the thing that I don’t understand is that if the answer to that question is that kernel launches take a long time, why is my CPU pegged at 100% and a faster CPU core gives better performance?]

I have a model (made up of a hand-implemented LSTM and a three-layer convnet) that achieves GPU utilization of only around (very roughly) 50% when training. There is always one core pegged at 100%, and the model trains slower on a machine with a slower CPU and faster GPU than on a machine with a faster CPU and slower GPU. I can change the LSTM from using 1024 units to 2048 units, and the model only trains about 20% slower.

I have tried profiling the model using cProfile (and I also profiled an earlier version with the line profiler), but nothing jumps out at me as out of the ordinary. It seems to spend roughly half of its time in the forward pass and half of its time in the backwards pass. I am not doing any significant preprocessing or transferring large quantities of data between the CPU and GPU (at least not intentionally). The python process uses 22GB of virtual memory, but there is only about 2GB resident, and the system seems to have plenty of free memory. The hard disk is mostly idle. My best guess is that maybe I am doing some things that take a lot of computation and have to execute on the CPU, but it is not clear to me what these things would be. Or possibly there is kernel launch overhead or CPU-GPU latency that is adding up, but in that case I am not sure why faster single core performance would help.


if your RNN is written in terms of RNNCell/LSTMCell/GRUCell + for loops and it is a fairly small RNN, then maybe you are suffering from autograd overhead. We’re working on this. See: https://github.com/pytorch/pytorch/issues/2518#issuecomment-327835296