PyTorch on CPU uses only python. Why?

I’m currently using PyTorch on a AMD CPU Epyc Gen2. I notice that sometimes (with a batch of, for example, 200) it uses several cores, but not more than half.

The next point worries me. When I call the top command (I work on Linux), I see that the processor is busy with the Python process.

I expected that PyTorch would compile the neural network for the architecture and that I would see a compiled C or C++ program in the top of list of processes.

What should I change so that PyTorch compiles the neural network into native processor instructions?

I’m currently using torch.compile and default backend, but it doesn’t have any effect.

Hi Veter!

This is to be expected. When you run python (before you have python
execute a script that happens to run some pytorch stuff), the python
interpreter runs as a process. When you import pytorch (and maybe
when you actually execute some pytorch stuff), the python interpreter
executes some pytorch python code that dynamically loads some
(compiled) c++ libraries (and potentially some cuda stuff) into that
original python process. So that single python process shows up in
top, and if those c++ pytorch libraries are dominating the use of the
cpu, that cpu usage is still charged to the python process that you
see in top (because, among other reasons, there is no other process).

No, the pytorch c++ (and potentially cuda) code is already compiled as
dynamically-loaded libraries.
The basic idea is that for “real” problems,
the compute time is dominated by applying various c++ / cuda layers to
batches of inputs that flow through the network and the overhead of the
(slow) python code that organizes applying those layers is typically not
particularly large.

Again, pytorch does not spawn additional processes to run the model.
The “fast” cpu / cuda code is packaged as libraries and dynamically
loaded into that single python process.

This is fully analogous to using a “python” package such as numpy.
Numpy also delegates its heavy lifting to compiled c++ code that is
dynamically loaded into the single python process.

Typically, pytorch uses python to string together large compute-intensive
chunks and the python overhead doesn’t matter. In such cases,
torch.compile doesn’t have much to work with to speed things up.
However, if you write some python loops that call lots of little tensor
operations, then torch.compile (depending on the details) may be able
to speed things up a lot.

As an aside, if your model uses big, chunky cuda operations, your cpu may
be largely free while it waits for the cpu, in which case top will not show
your python process as consuming much cpu (because it isn’t). However
something like nvidia-smi will show your python process consuming a lot
of gpu time.

Best.

K. Frank

Thank you, K. Frank!