Moving data to GPU in collate_fn fails

I have to generate a lot of randomized batches. One thing I can’t do is pre-storing all the data on the GPU (that would take too much space). So right now I’m moving the batch from the CPU to the GPU in the training loop.

I’d like to speed things up by pre-moving the data to the GPU in the batch worker (s.t. I can directly use it in the training loop).

Here’s the MWE that illustrates my plan:

import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import numpy as np

class ExampleDataset(Dataset):
    def __init__(self):
        super(Dataset, self).__init__()
    def __len__(self):
        return 100000
    def __getitem__(self, idx):
        return np.random.rand(3)

def custom_collate_fn(batch):
    batch = torch.tensor(batch)
    # i'd like to pre-move the data to the GPU but i get an error here:
    batch = batch.to('cuda', non_blocking=True)
    return batch

batch_loader = DataLoader(
    ExampleDataset(),
    batch_size=100,
    shuffle=True,
    num_workers=8,
    pin_memory=True,
    collate_fn=custom_collate_fn
)

# training loop
NUM_EPOCHS = 10
for epoch in range(NUM_EPOCHS):
    for batch_num, train_batch in enumerate(batch_loader, 0):
        # usually i transfer the train_batch from CPU to GPU here
        # that causes delays (i have huge batch-sizes)
        print('training.')

Here’s the error that I get:

/local/home/venv/bin/python -u /local/home/BT/mwe.py
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
Traceback (most recent call last):
  File "/local/home/BT/mwe.py", line 39, in <module>
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=55 error=3 : initialization error
    for batch_num, train_batch in enumerate(batch_loader, 0):
  File "/local/home/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/local/home/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/local/home/venv/lib/python3.5/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/local/home/BT/mwe.py", line 22, in custom_collate_fn
    batch = batch.to('cuda', non_blocking=True)
  File "/local/home/venv/lib/python3.5/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:55


Process finished with exit code 1

Do you know how I could achieve my plan? Unfortunately I can’t get rid of this error (I’ve tried setting torch.multiprocessing modes, but that didn’t help).

Check this out. It is a copy of your error.

As a summary setting cudnn.benchmark to False works for some. You could try that too. (Also, you should be setting up this way if this applies to your randomized batch sizes.) Others have fixed the error by mixing and matching CUDA and PyTorch versions.

Something you’re trying that’s different is to push to the GPU inside your collate_fn. (I’m sure this sparks a discussion on multiple processes copying data to GPU together. If you know of some such, do link it here) As a test, you could try to just return batch in custom_collate_fn without the .to('cuda') call. and in your for batch_num, train_batch... call add train_batch.to('cuda', non_blocking=True)

If it still fails with the same error, try sending a reply on that thread. If it does work, still send a reply on that thread; It will greatly help debug the issue :slightly_smiling_face:

setting cudnn.benchmark didn’t help (True and False).

setting CUDA_LAUNCH_BLOCKING=0 or 1 didn’t help

doing the .to call inside the for loop (instead of the collate function) works, but that’s what i’d like to avoid because i’d like to have the data pre-stored on the GPU already by the worker s.t. the data can be directly accessed.

yes i think that the child processes for the batch workers create a new cuda context. what should be done is to get the cuda context of the parent process and initialize the tensor there. but i don’t know how to do that.

Also I’ve tried the solutions from here:


However, I got the error:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

That’s interesting. I expected your error to be just the same as my previously posted link and so .to in your for loop should have failed. I’m unsure where to start debugging this and unfortunately its late where I am… Will pick this up tomorrow, if you’re still at it do tell.

Have you tried reading up on this and the discussions in commits that follow? May provide some insight if context is all you need

Thanks for your help! Yes, I’m still at solving the problem. It looks like I’d have to share the CUDA context from the main process with the spawned workers. However, I don’t know how to do that.

I do recommend posting something on this thread. The devs seem to be taking quite an interest in it and they’ve been active there.

Shubh