Inter-process sharing CUDA Tensor

As experiments show that the training results is better if I use larger batch size, I want to use batch size as large as possible.

And my dataset is time-series, every sample is a slide window view of the series, so there are many duplicated data between different samples.

Originally, I let my dataset sitting on CPU memory(as tensor or numpy array), and use the default sampling/collating function of the DataLoader to make every batches naturally.

But there are many many wasting of CUDA memory(pin_memory=True), so that the batch size can NOT be large enough, especially when I run with a lot of data loading workers.

Is there a way, I can make a CUDA tensor to be shared with multi-process? Because the data loading workers are different processes to the main process.

If so, I can send the entire dataset to CUDA before training. Then workers only need to operate on the same CUDA tensor as the main process, and only send the view(instead of copy) to the main process.

Thanks!

Forgive my English.

are you running out of CUDA / GPU memory or out of CPU memory ?

pinned memory (pin_memory=True) is always on the CPU in PyTorch, so it sounds like you are running out of CPU memory, is that correct ?

CUDA memory has the IPC handle which allows sharing between different processes. In PyTorch, when using multi-processing, IPC handles are automatically created under the hood, see e.g. Using CUDA IPC memory handles in pytorch
This means that you can simply send these tensors to other python processes using standard multiprocessing Queues from python.

But beware that the memory management can be a bit tricky with this, as the producer (sender) still owns the data, see also this doc for reference : pytorch/torch/multiprocessing/cuda_multiprocessing.md at main · pytorch/pytorch · GitHub

1 Like