How to use data previously sent to device in DataParallel module

I’m in a situation where the dataset is too big to load on a single GPU, but can be split across multiple GPUs on the machine.

In this case, I want to train on a (DataParalleled) model distributed across each GPU using data from each device.
DataParallel splits the given data into batch dimensions and sends it to a model distributed across each device.
A problem arises in this part.
I can think of a way to send the data already in each device to the CPU or one GPU, and then to each GPU again, but this is inefficient.
Running on small data, the model was able to process 100000 samples, equivalent to 200 GB per second. But with the added step of sending to the cpu, it’s 50x slower.

Are there any DataParallel tricks suitable for use in my situation? Or is it better to use multithreading?