Hello, I was wondering if anyone has experience looking into the RAM memory bandwith when running Pytorch CPU - whether memory bandwidth from both sockets are being used.
I have a setting with 8 DRAMs, 4 for each CPU socket (NUMA).
When I use intel PCM memory profiling to see the bandwidth utilization while running my Pytorch CPU code for large matmuls, I found that only 1 socket was being used.
I saw some posts that say this was deliberately done to avoid remote NUMA accesses.
But I want a workaround BAD, because I think CPU performance could actually be a lot better
Any comments? Would be grateful for any workarounds
You could suggest other frameworks (e.g., llama.cpp) or optimizations (torchcpu, torch.compile) that may make this happen!
Taehyun