Device error with dataloaders

kwijibohan · July 11, 2023, 9:29pm

Torch version 2.1.0.dev20230702+cu121

import torch
from torch.utils import data

torch.set_default_device('cuda')

class NullDataset(data.Dataset):
    def __len__(self) -> int:
        return 100

dataloader = data.DataLoader(NullDataset(), batch_size=64, shuffle=True, generator=torch.Generator(device='cuda'))

for data in dataloader:
    print(data)

gives errors at for data in dataloader: with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "blabla\venv\Lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "blabla\venv\Lib\site-packages\torch\utils\data\dataloader.py", line 676, in _next_data
    index = self._next_index()  # may raise StopIteration
            ^^^^^^^^^^^^^^^^^^
  File "blabla\venv\Lib\site-packages\torch\utils\data\dataloader.py", line 623, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "blabla\venv\Lib\site-packages\torch\utils\data\sampler.py", line 289, in __iter__
    for idx in self.sampler:
  File "blabla\venv\Lib\site-packages\torch\utils\data\sampler.py", line 167, in __iter__
    yield from map(int, torch.randperm(n, generator=generator).numpy())
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "blabla\venv\Lib\site-packages\torch\utils\_device.py", line 76, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I deliberately omitted the definition of __getitem__ for NullDataSet to show that it isn’t the cause of the error. How should I fix this? This error seems to be introduced by commit Do not materialize entire randperm in RandomSampler (#103339), calling .numpy() on a cuda tensor. Previously there was torch.randperm(n, generator=generator).tolist() instead of map(int, torch.randperm(n, generator=generator).numpy())

Krayaty · December 17, 2023, 1:24pm

Hi Joshua,

I got the same Error with my setup last week. I’m using conda and I had the latest CUDA driver installed for my RTX 3090. I had Pytorch in some Version of 2.1 installed. I can’t recall the exact Versions for CUDA driver and Pytorch. But I’m absolutely sure that I didn’t have the latest Pytorch Version.
I tracked the problem down to the same thing you suggest. At first I didn’t know how to fix it with coding so I just updated Pytorch to the latest stable version 2.1.2 and voila it was running again! I don’t know why and I can not reproduce it.
Later on I got the problem again when running my project on a slurm cluster with the latest stable Image from PyTorch | NVIDIA NGC. So I went on searching for a proper solution. In the end I just used a custom sampler for my dataloader. This fixed the problem for me.