Error when running tensor.to(device) in dataset.__getitem__()

Hello,

I am currently working on a 3D video processing program, and I have encountered some performance issues while using the PyTorch dataset and data loader. After conducting some benchmarking, I found that a significant amount of time is spent waiting on img.to(device) in the main loop. To speed up the processing, I experimented with offloading the img.to(device) work onto dataset.__get_item__.

This approach seemed to work well, as I observed a nice speed improvement. However, I frequently encountered the CUDA error: mapping of buffer object failed error during processing, which I have been unable to debug. Here’s a code snippet of what I was trying to do:

class FrameDepthFloDataset(Dataset):
    def __init__(self, src_in_files, depth_dir, device):
        self.img_files = sorted(src_in_files)
        self.depth_files = sorted(list(glob.iglob(depth_dir+'/*')))
        self.device = device


    def __getitem__(self, index):
        img_file = self.img_files[index]
        depth_file = self.depth_files[index]
        depth = load_torch_gray_img(depth_file)
        img = load_torch_image(img_file)
        depth = depth.to(self.device) # Here move data to GPU before returning
        img = img.to(self.device) # Here move data to GPU before returning
        return {'img_file': img_file, 'img': img, 'depth':depth}

    def __len__(self):
        return len(self.img_files)

And exception:

Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\multiprocessing\queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "C:\Users\richa\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\pypoetry\Cache\virtualenvs\3dengine-4gtTLmD6-py3.9\lib\site-packages\torch\multiprocessing\reductions.py", line 261, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
  File "C:\Users\richa\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\pypoetry\Cache\virtualenvs\3dengine-4gtTLmD6-py3.9\lib\site-packages\torch\storage.py", line 920, in _share_cuda_
    return self._untyped_storage._share_cuda_(*args, **kwargs)
RuntimeError: CUDA error: mapping of buffer object failed
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The data loader is configured with batch_size=1 and num_workers=4. Unfortunately, running with CUDA_LAUNCH_BLOCKING=1 showed the same error.

I would like to ask for help with troubleshooting this issue. I could not reason what CUDA error: mapping of buffer object failed means and why this error did not happen deterministically. Is it legitimate to load data onto the GPU within the data loader? My program is running on Windows 10 with PyTorch 2.0 and CUDA 11.8.

Thank you very much. Any help is greatly appreciated!

1 Like

Same Issue for a different Context. Do you define number of workers anywhere?

Yes, I used num_workers=4 for data loader. I have to use extra workers to load data efficiently.