Hello,
I am currently working on a 3D video processing program, and I have encountered some performance issues while using the PyTorch dataset and data loader. After conducting some benchmarking, I found that a significant amount of time is spent waiting on img.to(device)
in the main loop. To speed up the processing, I experimented with offloading the img.to(device)
work onto dataset.__get_item__
.
This approach seemed to work well, as I observed a nice speed improvement. However, I frequently encountered the CUDA error: mapping of buffer object failed
error during processing, which I have been unable to debug. Here’s a code snippet of what I was trying to do:
class FrameDepthFloDataset(Dataset):
def __init__(self, src_in_files, depth_dir, device):
self.img_files = sorted(src_in_files)
self.depth_files = sorted(list(glob.iglob(depth_dir+'/*')))
self.device = device
def __getitem__(self, index):
img_file = self.img_files[index]
depth_file = self.depth_files[index]
depth = load_torch_gray_img(depth_file)
img = load_torch_image(img_file)
depth = depth.to(self.device) # Here move data to GPU before returning
img = img.to(self.device) # Here move data to GPU before returning
return {'img_file': img_file, 'img': img, 'depth':depth}
def __len__(self):
return len(self.img_files)
And exception:
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\multiprocessing\queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\Users\richa\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\pypoetry\Cache\virtualenvs\3dengine-4gtTLmD6-py3.9\lib\site-packages\torch\multiprocessing\reductions.py", line 261, in reduce_tensor
event_sync_required) = storage._share_cuda_()
File "C:\Users\richa\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\Local\pypoetry\Cache\virtualenvs\3dengine-4gtTLmD6-py3.9\lib\site-packages\torch\storage.py", line 920, in _share_cuda_
return self._untyped_storage._share_cuda_(*args, **kwargs)
RuntimeError: CUDA error: mapping of buffer object failed
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The data loader is configured with batch_size=1 and num_workers=4. Unfortunately, running with CUDA_LAUNCH_BLOCKING=1
showed the same error.
I would like to ask for help with troubleshooting this issue. I could not reason what CUDA error: mapping of buffer object failed
means and why this error did not happen deterministically. Is it legitimate to load data onto the GPU within the data loader? My program is running on Windows 10 with PyTorch 2.0 and CUDA 11.8.
Thank you very much. Any help is greatly appreciated!