Dual CPU socket RAM utilization (Why is only one socket being used?!)

Hello, I was wondering if anyone has experience looking into the RAM memory bandwith when running Pytorch CPU - whether memory bandwidth from both sockets are being used.

I have a setting with 8 DRAMs, 4 for each CPU socket (NUMA).
When I use intel PCM memory profiling to see the bandwidth utilization while running my Pytorch CPU code for large matmuls, I found that only 1 socket was being used.

|960px;x304px;

I saw some posts that say this was deliberately done to avoid remote NUMA accesses.
But I want a workaround BAD, because I think CPU performance could actually be a lot better :frowning:

Any comments? Would be grateful for any workarounds :slight_smile:
You could suggest other frameworks (e.g., llama.cpp) or optimizations (torchcpu, torch.compile) that may make this happen!

Taehyun