How to Speed Up PyTorch Model in CPU Eager Mode?

Hi, I’m currently running a PyTorch model in eager mode on the CPU, and I’m looking for ways to improve inference performance.

The model is already using typical linear layers and tensor operations, but performance seems suboptimal compared to expectations. I’m not using TorchScript or torch.compile in this context — this is purely eager execution.

I would like to know:

  • How can I ensure PyTorch is utilizing all CPU threads efficiently?
  • Are there any recommended environment variables or settings (e.g., MKL, OpenMP) to tweak?
  • What are the best practices to maximize CPU performance in eager mode?

Here’s what I’ve tried so far:

python

import torch
torch.set_num_threads(4)
torch.set_num_interop_threads(4)

And I also set:

bash

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

But the speedup was marginal. Are there any other profiling tools or CPU-specific optimizations you would recommend?

Thanks!