As said, I am considering upgrading my setup but I do wonder if the multi-processing of dataloader can fully utilize all the P-cores and E-cores on Intel’s new 13th gen.
Based on my reading of the code of DataLoader, it seems like the speed difference between P-core and E-core won’t trigger performance issues. Still, I am asking this here to make sure of it.
I don’t think DataLoader has been tested on those hardware yet.
Based on my little understanding on the new CPUs, I would think there shouldn’t be significant differences but it is hard to tell. It depends on the task at hand (whether the data loading processes are compute-bound or IO-bound), and how well the CPU schedules/balances work between the P-core and E-core.
Please let us know if you have compared different generations of Intel CPUs and found noticeable performance differences.
Hi @nivek, I happen to have a personal i7 13th gen CPU at home and one i7 12th gen CPU at work, and I do observe significant (2x) slowdown of the same process on the 13th gen compared to the 12th gen.
I tried hard to explain this from code version differences this morning, but both machines are up-to-date arch installs with the same torch version 2.2.1, so I ended up suspecting hardware. I’m not exactly sure how I can help you know further about the problem, but I’m happy to help ^ ^"
import torch
torch.manual_seed(12)
torch.set_num_threads(1)
def batched_dot_mul_sum(a, b):
"""Computes batched dot by multiplying and summing"""
return a.mul(b).sum(-1)
def batched_dot_bmm(a, b):
"""Computes batched dot by reducing to ``bmm``"""
a = a.reshape(-1, 1, a.shape[-1])
b = b.reshape(-1, b.shape[-1], 1)
return torch.bmm(a, b).flatten(-3)
# Input for benchmarking.
x = torch.randn(1_000_000, 640)
# Ensure that both functions compute the same output.
assert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))
On the i7 12th machine:
$ time python benchmark.py
real 4,82s
user 4,18s
sys 0,63s
On the i7 13th machine:
$ time python benchmark.py
real 10,57s
user 9,39s
sys 1,16s