CPU usage far too high and training inefficient

Hi all,

I am training my model on the CPU. A very strange behaviour occured (that I could solve) but I thought I would bring it up because I cannot imagine that this is a desired behaviour:

So when I just train my model on the CPU on my PC with 24 cores, all 24 cores being used 100% even though my model is rather small (thats why I dont train it on the GPU). And most of the workload is also Kernel usage. The training time per epoch requires about 2.5 seconds. I have version 1.0.1.post2 on that PC.

So to make it train faster I pushed it to a server with 80 cores. There, however, I got the exact same behaviour: When training all 80 cores were used with 100% work load. The time per epoch took again about 2.5 seconds on average. On that server I use pytorch version 1.1.0.

Reading through some threads



I tried torch.set_num_threads(1) and this not just cut the CPU usage to one core (as expected) but the training also is much faster: About 1 seconds per epoch now.

So, I am not sure if this behaviour is really desired, as it seems like spreading the workload over all CPU cores not just requires all ressources but it also is much slower.

Hi,

Unfortunately, this is a known limitation. We use openMP to parallelize cpu work and by default it uses all the available cores. So for machines with many cores, it is sometimes necessary to manually reduce the number of cores it can use :confused:

okay, I see. But do you have any clue why the code is running so much slower when running on 80 CPUs as opposed to when running on one CPU?

It is most likely because it tries to parallelize many “small” operations on many cores. So each core has almost nothing to do, but the overhead of communication between the cores is very large.

By the way, this is in the process of being solved, you can track the progress here: https://github.com/pytorch/pytorch/issues/24080.

great, thanks a lot!

Seems there is no progress anymore on this?

I just today run into the same problem having 64 cores. A simple training was pushing all of them to 100% slowing down the training time immensely.

The fix was to set torch.set_num_threads(1) at the beginning.