As experiments show that the training results is better if I use larger batch size, I want to use batch size as large as possible.
And my dataset is time-series, every sample is a slide window view of the series, so there are many duplicated data between different samples.
Originally, I let my dataset sitting on CPU memory(as tensor or numpy array), and use the default sampling/collating function of the DataLoader to make every batches naturally.
But there are many many wasting of CUDA memory(pin_memory=True), so that the batch size can NOT be large enough, especially when I run with a lot of data loading workers.
Is there a way, I can make a CUDA tensor to be shared with multi-process? Because the data loading workers are different processes to the main process.
If so, I can send the entire dataset to CUDA before training. Then workers only need to operate on the same CUDA tensor as the main process, and only send the view(instead of copy) to the main process.
Thanks!
Forgive my English.