Dataloader is slow with mps

For some reason, when using mps the dataloader is much slower (to a point in which its better to use cpu).

Any ideas why?

code for reproduction:

class Dataset(
    def __init__(self, device):
        self.a = torch.tensor(1, device=device)
    def __len__(self):
        return 100
    def __getitem__(self, i):
        return self.a, self.a
for device in ['mps', 'cpu']:
    dataloader =, 64)
    %time next(iter(dataloader))

Thanks in advance!

I see 2 problems

  1. Overhead, your tensors are tiny and you don’t load all that many images so sending data to your GPU is slower than just doing the computation directly on CPU.
  2. Your benchmark is also problematic because you’re not doing any actual computation on the GPU so just sending data to the GPU wont give you any benefits because GPUs are fast at matrix multiplication but very slow at data transfers
1 Like

Thanks for the reply @marksaroufim !
I understand this is only a toy example which doesn’t take into account the benefits of the GPU.
However, when using it in a real training/eval process, this leads to GPU and CPU taking approximately the same time (the advantages of GPU “settle up” with the slow data loading).

Therefore I’m not able to benefit from the GPU.
I believe it’s a bit too slow. What do you think?