RuntimeError: Caught RuntimeError in DataLoader worker process 0

neilsambhu · November 27, 2022, 6:46pm

I’m trying to use DataLoader, which performs some operations (neural rendering) on GPU and outputs the result. I’m not able to get the DataLoader to load because it throws “RuntimeError: Caught RuntimeError in DataLoader worker process 0.” I have changed the num_workers to 0; this was not successful and stopped my code from having any output (i.e. no print statements; no error messages). Here is the code I’m referencing: DualAttentionAttack/data_loader.py at main · nlsde-safety-team/DualAttentionAttack · GitHub

srishti-git1110 · November 28, 2022, 4:24am

Hi,
Please try to post a minimum executable snippet that reproduces your error enclosing it within ```.

Such errors are sometimes also caused while sampling. Make sure the indices generated by your Sampler are aligned with your dataset’s indices.

neilsambhu · November 28, 2022, 4:59pm

Here is a minimal version of the dataloader setup:

from torch.utils.data import Dataset, DataLoader
from torchvision.io import read_image
class MyDataset(Dataset):
  def __init__(self):
    self.image_paths = glob.glob('./dataset/**.png')
    # a torch.nn.module-like model
    self.renderer = NeuralRenderer(img_size=(225,225)).cuda()
  def __getitem__(self, index):
    img = read_image(self.image_paths[index]).cuda()
    return self.renderer(img)
dataset = MyDataset()
loader = DataLoader(dataset=dataset, batch_size=1, shuffle=False, num_workers=0)

Upon iterating over the loader is when I see the runtime error.

srishti-git1110 · November 29, 2022, 5:48am

Please ensure image_paths isn’t empty and that you are able to retrieve files by indexing it like -
image_paths[0].

Please post here if you still face the error.

neilsambhu · November 29, 2022, 5:18pm

I’ve ensured the path isn’t empty.

neilsambhu · November 29, 2022, 5:28pm

I’m able to access the images at any given index.

srishti-git1110 · November 30, 2022, 1:47am

Please implement the __len__() function in your MyDataset class.
(This function just returns the length of your dataset and is expected by many implementations of the Sampler class and the DataLoader.)
If it still doesn’t work, please show the full error message that you are getting.

Maybe this could help:

class MyDataset(Dataset):
  def __init__(self):
    # body
  def __getitem__(self, index):
    # body
  def __len__(self):
     return len(image_paths)

neilsambhu · November 30, 2022, 2:51am

Thanks for the reply. I only provided a very stripped-down version of a rather large implementation, and the full version does have len(). It seems like it doesn’t like that some pre-processing is happening on GPU before the dataloader can collate the dataset into batches?

Traceback (most recent call last):
  File "train.py", line 373, in <module>
    run_cam(test_dir, EPOCH)
  File "train.py", line 279, in run_cam
    for i,j in enumerate(loader):
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/nsambhu/github/DualAttentionAttack/src/data_loader.py", line 87, in __getitem__
    imgs_pred = self.mask_renderer.forward(self.vertices_var, self.faces_var, self.textures)
  File "/home/nsambhu/github/DualAttentionAttack/src/nmr_test.py", line 249, in forward
    return self.RenderFunc(vertices, faces, textures)
  File "/home/nsambhu/github/DualAttentionAttack/src/nmr_test.py", line 167, in forward
    vs = vertices.cuda().numpy() # 11/27/2022 1:28:46 PM: Neil added
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

srishti-git1110 · November 30, 2022, 5:49am

In that case, have you tried removing the .cuda() calls from the init and getitem methods? Does it work then?

neilsambhu · November 30, 2022, 5:00pm

Unfortunately, it still fails:

Traceback (most recent call last):
  File "train.py", line 373, in <module>
    run_cam(test_dir, EPOCH)
  File "train.py", line 279, in run_cam
    for i,j in enumerate(loader):
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/nsambhu/github/DualAttentionAttack/src/data_loader.py", line 97, in __getitem__
    imgs_pred = self.mask_renderer.forward(self.vertices_var, self.faces_var, self.textures)
  File "/home/nsambhu/github/DualAttentionAttack/src/nmr_test.py", line 249, in forward
    return self.RenderFunc(vertices, faces, textures)
  File "/home/nsambhu/github/DualAttentionAttack/src/nmr_test.py", line 167, in forward
    vs = vertices.cuda().numpy() # 11/27/2022 1:28:46 PM: Neil added
  File "/home/nsambhu/.conda/envs/dualattentionattack/lib/python3.7/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

ptrblck · November 30, 2022, 8:10pm

Remove all CUDA calls from your Dataset.__getitem__ method as @srishti-git1110 already mentioned. The current code fails in trying to re-initialize the CUDA context in a new process since you are trying to move a tensor to the GPU in:

vs = vertices.cuda().numpy()

and directly afterwards to the CPU, which also sounds not necessary.
If you want to use a numpy array, just get rid of the .cuda() call to avoid moving the tensor back and forth.

Yami · March 14, 2024, 1:21pm

Removing the .cuda() works for me! Thanks!