DataLoader multiprocessing with Dataset returning a CUDA tensor


I have a use case where my Dataset’s __getitem__ fetches a Tensor somewhere in the memory of the main process, transfers it to the GPU to do some basic image processing faster than it ever could on the CPU, and returns a CUDA tensor directly. Then there’s a DataLoader on top of that.
The DataLoader works fine when using num_workers=0, however I’m getting errors whenever I try to use multiprocessing by increasing num_workers

Here’s a MWE :

import torch

class CudaDataset(
    def __init__(self, device):
        self.tensor_on_ram = torch.Tensor([1, 2, 3])
        self.device = device

    def __len__(self):
        return len(self.tensor_on_ram)

    def __getitem__(self, index):
        return self.tensor_on_ram[index].to(self.device)

ds = CudaDataset(torch.device('cuda:0'))
dl =, batch_size=1, pin_memory=False, num_workers=2)

# First time runs with no issue at all
for i in dl:

## Let's do it a second time
for i in dl:  # Here it throws an error

Here’s the error :

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "<ipython-input-1-3f858a29a121>", line 12, in __getitem__
    return self.tensor_on_ram[index].to(self.device)
  File "/usr/local/lib/python3.7/dist-packages/torch/cuda/", line 207, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I’ve tried adding the line torch.multiprocessing.set_start_method('spawn') at the top, but then I get DataLoader worker (pid(s) 1078) exited unexpectedly

I’m not sure whether that use case is possible, I just wanted to benchmark the performance gain that I could obtain with more workers, as this is currently my bottleneck.

Does anyone know a way to work around this ? Thanks

I don’t think that moving the data to GPU right after loading it is a good idea.

An easy workaround would be to load the data in the getitem function and use the dataloader to fetch a batch of this data. Then you can move this data to GPU and apply the processing steps there.

Another workaround would be to pass a new collate_fn in the dataloader class.

It’s better to use pin_memory in DataLoader to transfer data into shared memory. Then, move the Tensor to your main process to CUDA.

And, when you run spawn, please wrap your script using if __name__ == "__main__":.