Does DataLoader work normally on Intel's 13th gen CPU?

TimandXiyu · December 19, 2022, 9:48am

As said, I am considering upgrading my setup but I do wonder if the multi-processing of dataloader can fully utilize all the P-cores and E-cores on Intel’s new 13th gen.

Based on my reading of the code of DataLoader, it seems like the speed difference between P-core and E-core won’t trigger performance issues. Still, I am asking this here to make sure of it.

nivek · January 3, 2023, 8:50pm

I don’t think DataLoader has been tested on those hardware yet.

Based on my little understanding on the new CPUs, I would think there shouldn’t be significant differences but it is hard to tell. It depends on the task at hand (whether the data loading processes are compute-bound or IO-bound), and how well the CPU schedules/balances work between the P-core and E-core.

Please let us know if you have compared different generations of Intel CPUs and found noticeable performance differences.

iago-lito · March 28, 2024, 11:35am

Hi @nivek, I happen to have a personal i7 13th gen CPU at home and one i7 12th gen CPU at work, and I do observe significant (2x) slowdown of the same process on the 13th gen compared to the 12th gen.

I tried hard to explain this from code version differences this morning, but both machines are up-to-date arch installs with the same torch version 2.2.1, so I ended up suspecting hardware. I’m not exactly sure how I can help you know further about the problem, but I’m happy to help ^ ^"

I used this example to benchmark:

import torch

torch.manual_seed(12)
torch.set_num_threads(1)

def batched_dot_mul_sum(a, b):
    """Computes batched dot by multiplying and summing"""
    return a.mul(b).sum(-1)


def batched_dot_bmm(a, b):
    """Computes batched dot by reducing to ``bmm``"""
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, b.shape[-1], 1)
    return torch.bmm(a, b).flatten(-3)


# Input for benchmarking.
x = torch.randn(1_000_000, 640)

# Ensure that both functions compute the same output.
assert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))

On the i7 12th machine:

$ time python benchmark.py
real 4,82s
user 4,18s
sys 0,63s

On the i7 13th machine:

$ time python benchmark.py
real 10,57s
user 9,39s
sys 1,16s

What could be wrong?